Saturday 22 December 2012

Regular expressions in CFML (part 3: syntax - single characters)

G'day:
This is part three of the series I started with the introduction article: "Regular Expressions in ColdFusion (part 1: overview)", and followed with a discussion entitled "Regular expressions in ColdFusion (part 2: concepts)".

Syntax

OK, so I'm now going to try to describe how all those concepts are actually reflected in regex syntax. This is where I am going to dispense with most of the keys on the keyboard, and just use the punctuation keys ;-)

(not really)

I'll go through each of those concepts in turn.


Matching a literal string

This is as simple as specifying the character to match, like I said in the previous article. So "a" will match "a" in "G'day mate." The only convolution here is that if the character you want to match is a meaningful one as far as the regex syntax goes, one needs to escape it. So to match the fullstop in "G'day mate.", one needs to do this: "\." (fullstop means "any character", so if you just use ".", it'll match the "G"). The backslash is the escape character.

Another thing to note is that the "reserve`d-ness" of a regular expression "reserved character" is contextual. So if the "." is alone, it means "any character", however if it's in a character set (eg: "[.]"), it just means "fullstop" and does not need to be escaped. A lot of the time people escape stuff they don't need to, and makes an already-cluttered regular expression look even more cluttered and indecipherable.

A character

A character is the basic element of a regular expression pattern: a character is what's being matched; how many, what's nearby and whether it's at the beginning or end of the string are just modifications to this. There are several ways of representing a character

[a literal character]

See above

. (dot / fullstop)

A dot matches any character. It can be a letter, a number, a non-printing control character (like ASCII 7, which used to be rendered via CTRL-G, and made the PC go beep. OK: I'm old), a line break... anything. Any one thing. In some regex dialects a dot does not match line breaks, but in ColdFusion they do. This is something to remember.

Example:
"b.t" will match "bat", "bet", "bit", "bot", "but". Or "b[tab]t", "b1t", "b[new line]t".

Character sets

A set of characters to match can be expressed with square brackets. Any one of the set of characters can match a single character in the target string, eg:

"dr[aiu]nk" will match "drink","drank","drunk" (that's in the order it usually happens to me).

Now... note that the characters in the character set are treated as individual characters, not a sequence of characters. This has two ramifications that I have seen people being caught out by:

"[cat]" will find a match in "fact" just as much as it will find a match in "catch". It's not looking for "cat", it's looking for "a c, an a or a t". And it'll find the "a" in "fact" and the first "c" in "catch".

Equally, I see people doing this: "[c|a|t]" (the pipe means "or" in regex parlance). OR is implicit already in a character set, so it's not needed. It will do exactly what is expected, but it's just extra clutter.

Character set negation

One can create a character set which indicates matching anything but what's in the character set. this is achieved by using the ^ as a NOT operator:

"c[^ie]t" will match "cat", "cot" and "cut" (and stuff like "c_t"), but will specifically not match "cit" and "cet".

Like before, sometimes people try to do this: "This sentence is about dogs", "this other one is about cats". And they want a positive result only for the "dogs" one, so they try "[^cat]" thinking it means "not cat". No: it means "not c, a or t". And it will still match the "h" in the "this" in the first sentence.

Something else I just realised as I was writing this article, is one cannot do this:

[ab^cde]

Where the expectation is to match "a" or "b", but not "c", "d", or "e". For the "^" to mean "not", it must be the first character of the character set. Otherwise it just means, literally, "^". So that character set would find a match in the first position of "^ will be found".

Character set ranges

So far the sequences of characters I've been using in th character sets have been pretty small. But what if one wanted to match letters A through M? It'd be a bit rubbish to have to do this: "[ABCDEFGHIJKLM]". Fortunately one does not have to. One can specify a character range in the character set too. Like this: "[A-M]". Easy! The caveat that comes with this is that if one wants to actually match the "-" character, then it must be the last character, eg: "[A-]" matches either "A" or "-". The reason why one cannot simply escape it - eg: "\-" - must be 'cos it could be confused as a range between "\" and [whatever character is after the "-"], eg: "[\-`]" would match any one of "\", "]", "^", "_", "`".

I have to concede I am not sure what happens if the sequence does not reflect a range from low to high, eg: "Z-A"... I just tested and it isn't valid. Something to remember.

Aside:
I've actually learned three new things in the course of writing this article so far! Cool.

POSIX character classes and escape sequences

Some valid characters don't mesh so well with textual representation, or can't be easily represented in a string. These are catered for in one of two ways with CFML regular expressions, both approaches having common ground, but both having their own specialities.

The first approach available is to use something called "POSIX character classes" (I had to look that term up). These are of the form "[:SOMESPECIALNAME:]", and they are used in a character set like so: "[[:SOMESPECIALNAME:]]", eg "[[:punct:]]". Note the double-up of the square brackets: the outer ones are just the usual character set delimiters, as per "[AEIOU]", the inner ones are part of the POSIX character class itself. "[:punct:]" matches a range of predefined ASCII punctuation characters. This saves one remembering what they all are, and having to write this: "[\]\[!"#$%&'()*+,./:;<=>?@\^_`{|}~-]" (and in CFML, you'd need to escape that "#" too). There's a dozen or so of these POSIX character thingeys, and they're in the docs.

The other thing to note here is that these character classes can be negated too, eg: "[^[:punct:]]".

The other option is using escape sequences. These are much shorter than the POSIX ones, in that they are just like "\d" which means "any digit". These can be used in a character set, or by themselves, eg: "[A-F\d]" would match all hexadecimal digits (as would "[[:XDIGIT:]]", btw); whereas "\d" by itself would match one digit, 0-9 (so the same as "[\d]". That's something to remember: if you just want digits (or any other sort of character expressed by a single escape sequence), don't bother making it into a character set.

Escape sequences can also be negated, but slightly differently. "\w" means "an alphanumeric character or underscore", whereas "\W" (notice the former is lowercase, the latter is uppercase) matches "any non-alphanumeric character". In a character set, one can still do this: "[^\w]" (same as "\W"). the reason by both are supported is - remember - the "^" thing only works in a character set (indeed: outside a character set it has an entirely different meaning, which I'll get to), and the escape sequences themselves don't need to be in a character set. Just to make that last point more clear, one might do this:

[a-z]+\s

This matches one or more letters, followed by some whitespace. I don't want the "\s" to be in the character set, because otherwise I could match "a b c " when I really only want "abc ". And, likewise, it's not part of its own "set", as there's only one of it. So I don't want to do this: "[a-z]+[\s]". I could do that, but I don't need to. Hopefully that makes sense.

Escape sequences also represent things other than characters. For example a "\b" is something I use a lot, it means "word boundary", so for argument's sake I could have this "[a-z]+\b". This differs from "[a-\z]+\s" in a coupla ways. First, if the string was "tahi rua toru wha", the "\b" version would match just "tahi", whereas the "\s" version would match "tahi " (see the space at the end there?). Also a word boundary can be something other than whitespace character, eg: "rima, ono, whitu, waru". And what we want to match is [the word] not [the word and whatever comes next]. This sort of thought process is important to remember when dealing with regexes: "exactly what am I trying to match here?"

Another thing escape sequences can do is to express a character by its octal (who the hell uses octal?) or hex sequence, eg: "\x41" is "A", as is "\d101". The ColdFusion regex engine does not support anything other than ASCII as far as I know, so one cannot express UTF-8 or other multi-byte characters (someone correct me if I'm wrong... I'm not in front of a computer at the moment to check).

All the escape sequences are in the docs. There's no point me repeating them all here: the explanations in the docs are about what I'd say about them too.

That's about it as far as matching a single character goes. How do we match more than one?


[to be continued...]

--
Adam