Monday 24 December 2012

Regular expressions in CFML (part 4: syntax - repetition, sub-expressions and back-references)

G'day:
This is part four of the series I started with the introduction article: "Regular Expressions in ColdFusion (part 1: overview)", and followed with a discussion entitled "Regular expressions in ColdFusion (part 2: concepts)". Then I moved onto syntax with "Regular expressions in ColdFusion (part 3: syntax - single characters)".

Repetition (again)

In a previous article I listed the types of repetition one can express in a regex:
  • Zero times
  • Zero or one times
  • Zero our more times
  • One time
  • One our more times
  • An exact number of times
  • Between a specified minimum and maximum number of times
  • At least a minimum number of times


The thing to remember here - and I've already mentioned this - is that when specifying repetition it's the PATTERN that gets repeated, not the match resulting from the pattern. So "[ABC]{2}" matches any sequence of As, Bs or Cs, exactly two times. So not only would it find a match in "bluBBer", it'll also find numerous matches in "ABACus" (the AB, then potentially the AC; or the BA is also a possibility, depending on from where the matching exercise starts). So if one specifically wants to match runs of As, Bs etc, it's a different regex, eg: "A{2}|B{2}|C{2}" (the pipe means "OR"). OK, don't forget that.

How do we express that myriad of repetition options? They're done with modifiers which are appended to a pattern, eg: "[abc]?", "[abc]*", "[abc]+". The "?", "*", "+" are all modifiers.

As all the modifiers work in the same way, I'll just present then in a table.


ModifierMeaning
?Zero or one occurrences
*Zero or more occurrences
+One or more occurrences
{n}Exactly n occurrences
{m,n}Between m and n occurrences, inclusive
{n,}At least n occurrences

In the list at the top of this section, I also mentioned one can match zero occurrences of a pattern, and exactly one occurrence of a pattern. I left these out of the table as they are not handled with modifiers.

Exactly zero-occurrences are handled with negative look-aheads, which I'll cover in a later article as it's a complicated topic to get one's brain around.

Exactly one occurrence is the default behaviour of a pattern, so does - by its very nature - not take a modifier. IE: "[abc]" matches exactly one occurrence of "a", "b" or "c".

It's worth noting that the ones that match zero occurrences really do match zero occurrences.  So "A*" will match in "BBB", because at the first position of "BBB" there are indeed zero As. This makes this sort of modifier seem a bit useless, but when used in conjunction with others, it is very handy.

Greediness

The repetition modifiers can be further modified to make them either greedy (default) or non-greedy. One makes a repetition modifier "non-greedy" with the "?" (not the same one that itself is a modifier meaning "zero or once", as per above). Here's a live example as to why one might need to specify an non-greedy match:

<strong>First</strong>
<em>In between</em>
<strong>Last</strong>

Say you wanted to match whatever was in a <strong /> tags.  You could do this:

<strong>.*</strong>

Unfortunately ".*" will match as much as it can, so the match here will be:

<strong>First</strong>
<em>In between</em>
<strong>Last</strong>

If, however, we modify our * modifier to be not-greedy:

<strong>.*?</strong>

We get the match we want:

<strong>First</strong>
<em>In between</em>
<strong>Last</strong>

The non-greedy modifier can be used with the *, + and {n,} operators. Apparently it can also be used with the "?" modifier, meaning - I guess - the fewest of zero or one match. So "A??" would match zero or one As, matching zero if possible.

Anchors

One can anchor a pattern to need to match from the beginning of the string, or the end of the string, or be anchored to both (so in effect the pattern much match the entirety of the string).  The anchor for the beginning of the string is "^", and for the end of the string is "$". Note that one might think that we've already used "^" to mean "NOT" in the context of a character set, but that is only the case within a character set  (so within "[]"). Outside of a character set it means "anchor to the beginning of the string". In most situations, the "^" needs to be the first thing in the pattern, and the "$" the last thing. There are exceptions to this, but this is where you're generally see them.

What's the significance? I gave trite "^Once upon a time", "and they lived happily ever after.$" examples when describing "anchors" in an earlier article, but a more real world one would be validating that a string is - for example - of the format "XXX-NNN" where "X" is a letter, and "N" is a digit, eg: "ABC-123".

One could try to do this with a param statement:

param name="form.productId" type="regex" pattern="[A-Z]{3}-[0-9]{3}";

This looks fine, but it will quite happily pass "this is the product ID format: XXX-999, please remember this". It matches because there is indeed a match in there, as I have indicated. What we need to do is to make sure the pattern is anchored to both ends of the string: "^[A-Z]{3}-[0-9]{3}$". This means the string must start with XXX-999 and then end.

Note that one doesn't need to use both anchors, but I find myself doing so more often than not: I generally want to force a pattern to match the entirety of the string, rather than being interested in what's at the end or what's at the beginning, in and of themselves.

There are two more anchors: \A and \Z which work the same way. They differ from "^" and "$" in that a string can be processed in "multi-line mode" (see further down, under "Flags") meaning "^" and "$" denote the beginning and end of a line, whereas \A and \Z always mean the beginning/end of the entire string.I'll expand on this in a later article.

Sub-expressions

One can create a sub-expression out of one or more patterns by putting parentheses around them. There are different types and usages of sub-expressions: these are denoted by modifiers placed within the parentheses, at the beginning of the sub-expression (I'll explain and demonstrate below).

At its most basic, a sub-expression can be used just like parentheses are used in a mathematical expression to remove ambiguity, or force an order of operations.

Say one wanted to find matches of "swim" or "swam" (slightly contrived, but it's an easy one to demonstrate). One might try this as a pattern:

 swi|am

(Remembering that "|" means "OR"). Unfortunately that won't work as intended because it won't match "sw followed by either i or a, followed by m", it will match "swi or am".  This can be remediated by making the "i|a" into a subexpression:

sw(i|a)m

(one still would not do it this way, one would do sw[ai]m, but that is no good for an example in this case ;-)

So the parentheses here just group a pattern together, or more a case of keeping a pattern separate from adjacent patterns.

I'm going to digress for a moment and talk about back references, before returning to sub-expressions.

Back-references

Another side effect / usage of parentheses like this are that whatever is matched within the parentheses is "remembered" for future reference via a back-reference. Back-references are denoted in a pattern via the syntax "\n" where "n" is a number referring to which sub-expression one is meaning: a pattern can have multiple sub-expressions, so intrinsically multiple back-references.

Back references can be used in both a find pattern and a replace pattern.

We could find all strings that are delimited with either a <strong> or a <b> tag:

<(strong|b)>.*?</\1>

The \1 back-reference refers to what was matched, not the pattern. So the match here is only for strings with an opening and closing strong tag, or an opening and closing b tag, but not a string which starts with an opening strong tag and ends with a closing b tag.

Or we could reformat a date in mm/dd/yyyy format to be more sensibly-formatted in dd/mm/yyyy:

(\d{2})/(\d{2})/(\d{4})

Would be replaced with:

\2/\1/\3

One thing to consider here is that there can be as many sub-expressions - and therefore back-references - as one wants: the back-references can continue past \9 to \10... \99 etc. This is handy but can be flummoxing if one wishes to have a back-reference followed by a digit. Say for example we have a string which is a bunch of letters followed by a 1, and we want to change that to a bunch of letters followed by a two.  One might think we could use these two regexes:

Match: (.*)\d
Replace: \12

However you can see what's gone wrong here... the regular expression parser sees "\12" as the twelfth back reference, not the first one followed by a two. This is a daft example and it might sound like a contrived case, but things like this do crop up.

I never worked this out (nor found a solution online), but my colleague Simon came to the rescue with this technique:

Match: (.*)\d
Replace: \1\E2

"\E" is an end marker for when doing case conversion (again, this is not covered until a later article: I'll link to it when it's published)... one delimits a sequence of characters starting with "\U" or "\L" (upper or lower case), and ending with a "\E", eg: "\Ushouting\E". However in the context above, \E is a) meaningless and b) zero-width, so it simply acts to clarify the separation between the back reference and the digit. Nice one.

Sub-expressions again

One doesn't always want to "remember" a grouping for a back-reference, so one can tell the regular expression processor not to (to save resources, I guess).  This is done like this (recycling an earlier example):

sw(?:i|a)m

Here we are using the sub-expression just to clarify what's having the OR operator applied, but we don't need to know whether the subexpression matched "i" or "a", so we use the "?:" modifier to tell the engine not to remember it. I am split as to whether the improved resource usage is worth the extra clutter in the pattern. I have never encountered any issues with regular expression operations running out of memory or slowing down or anything, so I'm not sure what sort of saving one gets with not capturing the sub-expression for a back-reference. I guess one "self-documentation" benefit is that if one's got a complex pattern with lots of sub-expressions, it's helpful to know which are used for back-references and which are not.


Right. The next section of the "sub-expressions" coverage is "look-arounds", which is fairly complicated and long-winded (especially the way I describe it ;-), so I'm stopping here, and will publish that article after Xmas. I really don't expect anyone will be reading this stuff until the new year anyways, so this won't be too much of a let down.  And it makes this article end on a nice Scheherazade-esque cliff-hanger. Of sorts. If, like, a discussion on regular expressions can have a "cliff hanger" ;-)

Anyway... if you're reading this on Xmas Eve / Day, you're a weirdo (almost as much of a weirdo as I am for writing it...). Go wrap or open a gift, or eat too much food or drink too much drink or just treat it like any other day or whatever. Basically whatever your solstice-observance activities usually entail, I hope you have a good one.

[to be continued]

--
Adam