Thursday 20 December 2012

Regular expressions in CFML (part 2: concepts)

G'day:
This is part two of the series I started with the introduction article: "Regular Expressions in ColdFusion (part 1: overview)". Initially I set out to only write one article, but by the time it was over 8000 words long, I figured I should split it up and serialise it otherwise no-one would be brave enough to read the whole thing.  If you see references in the text to "see above" or "see below", it might refer to something in one of the other articles. I'll try to find them all and replace with links, but no-doubt I'll miss some.

Concepts / components of a regular expression

A regular expression is built out of all those seemingly random sequences of punctuation characters. Taken en masse, that's quite impenetrable, and it's good to take a step back to consider the various notions / components / building blocks that are important to regexes. This is just a narrative / conceptual discussion, rather than delving into syntax, just for the moment. Knowing about the concepts is more important as the minutiae of the syntax, and once we have a handle on the concepts, the synactical vagaries will make more sense, and the expressions themselves will seem less daunting. In theory ;-)


Literal string

Anything that does not have special meaning in regular expression syntax is a literal string. As I mentioned in the previous article, this is loosely analogous to the relationship between CFML and other text within a CFM file.

A character

A character is exactly what it sounds like. It's something that is matched in the string being searched. It could be:

  • "any character", and by this I don't mean "an A, or a 6 or a # or something", I mean a match that could be [any character], if you see the subtle difference. If one has a pattern that matches two of [any character], it does not mean something like "AA" (where any character is matched, then a second one of those characters), it could be "A~": any two characters. This idea is important to latch on to;
  • "any of a subset of characters" (eg: vowels; the digits 1, 2 or 3; etc);
  • "anything that's not of a subset of characters" (eg: "not digits");
  • "special" character like a carriage return or tab;
  • a "character class" like "digits" or "whitespace" (or, indeed, "not whitespace", etc). Can also be reflected by an "escape sequence";
  • a hex or octal representation of a character.

Repetition

Part of the pattern can be how many times the [thing] is matched. EG: we could match "EE" (two Es), as opposed to just "an E", or "three Es". The types of plurality regexes have are:

  • zero times (the specific absence of a pattern);
  • zero or one time;
  • zero or more times;
  • exactly once;
  • one or more times;
  • a specific number of times;
  • between a minimum and maximum number of times;
  • at least a specific number of times (eg: "six or more" times).
To reiterate and expand on what I said above: an important thing to remember here is that it's the pattern that has repetition, not what's matched. I'm trying to avoid specific syntax here, but if your regex was to match "[any letter] [between two and three times]", this does not only mean "AA" and "BBB" etc, it means "NZ", "GBP" etc: so it's not "[match a character] [and that exact character 2-3 times]". I hope that makes sense. Obviously one can have a regular expression pattern  that matches "AA" and not "NZ" (and, indeed, the other way around), but the thing to remember here is that it's the pattern that has repetition, not the result of matching the pattern.

Modifiers

Modifiers are used to reflect the repetition. There are specific characters that mean "zero or more", "one or more", etc.

Character set

These are used to express the idea I mentioned above of "vowels" ("vowels" means nothing to a regex, but a character set of "A, E, I, O, U" will reflect "vowels"), or "digits 1, 2, 3". One can define these. A character set will match one character, but it can be any character in the set. When you see the syntax, it will be easy to get confused and think a character set will match multiple characters.

Sub-expressions

A series of patterns can be grouped together, and treated as a unit, for then being modified or remembered for later use. EG: one can have a regex that says "[a digit] [a letter] [both of those, three times]", so that would match "A1A2C3", but not "AAA2C3".

Back-references

The match of a sub-expression can then be referred to later in the regular expression (or in a subsequent regex, if doing a replacement). So one could match "<h[digit]>[any stuff here]<h[that same digit from before]>", so it would match "<h1>hello world</h1>", but not "<h1>Hello World</h2>" (which is obviously invalid anyhow, but you get the idea).

Greediness

There's a concept of greediness. When matching "one or more As" in the string "AAAAAAAA", a greedy regular expression will match as many as it can (so: the whole lot -"AAAAAAAA"- in this example); a non-greedy one will match as few as it can (So just "A", which fulfills the "one or more" requirement). Note that the rule is actually "as many (greedy) or as few (not-greedy) characters that still allow the rest of the pattern to be matched". I'll get to the significance of this later.

Width

A pattern has a width. A character-matching pattern will have a positive width (eg: it's one character wide); some pattern elements have "zero width", for example one can have a pattern that looks for the word "Adam", but will not match "madam", because part of the pattern indicates where the word boundaries are (a word boundary is where the word ends, and something else - eg a space - begins). The boundary is not reflected by any characters, it's reflected by a change from "word characters" to "non-word characters" (eg: white space, or punctuation, etc), so its width is zero: it occupies the "space" between two characters, not a character itself.

Look-arounds

A regular expression can check to see if a pattern is there, without the match of the pattern progressing the point at which the next pattern is looked-for. This is hard to explain without working through an example (which I will do in a subsequent article, when I'm going through look-around syntax), but I'll try. One could have a pattern "[any letter] [which has a vowel after it]". All we are interested in is the letter, not the vowel that follows it. So if we have the string "ABCDEF" used a "positive look-ahead" for that, the match would be just be "D"; if we just had a normal regex, then the match would be "DE". There are look-aheads and look behinds, and negative look-aheads and negative look-behinds. The ahead-ness and behind-ness are where the regex engine looks, relative to current position in the string: so a look-behind could match "a digit that has a digit immediately before it". The negativity is just a way to check for the absence of things: "a vowel which isn't followed by another vowel". ColdFusion does not support look-behinds with its internal regex engine.

I mentioned "the point at which the next pattern is looked-for" above. This warrants further explanation: I will cover this in a later article when I'm demonstrating the syntax.

Flags

A regex can have a number of flags which alter how it is processed. These affect things like case-sensitivity; whether the string being checked is consider one long string, or line-by-line; or whether the regex has whitespace and comments in it. Yes, a regular expression can have comments in it.

Anchors

A regex can be anchored to the beginning of the string, or the end of the string, or both. This means one can have a pattern to search for "Once upon a time" at the beginning of the string, but not anywhere else within the string. Or match "And they all lived happily ever after." at the end. If both are used, then the pattern must match the entire string.

Misc

There another few odds 'n' sods that are vagaries of how those concepts work, and we'll just look at them individually when we cover their syntax.

Having teased you with the concepts, this is a fairly demotivating place to break off, but I'm up to about 1400 words already in this article, and the next logical section is quite long, so I'll stop at this narrative juncture.

[to be continued...]

--
Adam