Thursday, 3 January 2013

Regular expressions in CFML (part 6: syntax - flags and the odds 'n' sods that are left )

This is part six of the series I started with the introduction article: "Regular Expressions in ColdFusion (part 1: overview)", and followed with a discussion entitled "Regular expressions in ColdFusion (part 2: concepts)". Then I moved onto syntax with:
"Regular expressions in ColdFusion (part 3: syntax - single characters)";
"Regular expressions in ColdFusion (part 4: syntax - repetition, sub-expressions and back-references)";
"Regular expressions in ColdFusion (part 5: syntax - look-arounds, and how the engine parses the string it's matching)".


There are a few different "modes" the regex engine can use when processing a string. These are summarised as follows:

(?x)This flags to ignore whitespace within the pattern, so one can split it across multiple lines and add comments (for clarity).
(?m)This specifies the string being matched within should be treated as multi-line, so the ^ and $ anchors can be used to denote the beginning and end of a line. The \A and \Z characters still denote the beginning and end of the entire string.
(?i)This specifies the pattern should be considered case-insensitve. This is somewhat redundant in CFML as there are separate functions for case-sensitive and case-insensitive operations. This still works though.


If you're read Ben Nadel's blog, you'll know he's mad for the whitespace (I mean "mad" as in "he's a loony", not simple "keen on"). And he uses the (?x) flag extensively in his examples. I actually agree with his use of whitespace with regex patterns, as they're bloody hard to read sometimes. Say I was wanting to match a UUID/GUID, I could just have this:


Which is gibberish. I could clarify that as:

(?x)                ## allow whitespace and comments
                    ## This is a regex which matches:
^                   ## from the beginning
[0-9a-f]{8}         ## eight hex digits
(?:                 ## just a grouping for repetition
    -[0-9a-f]{4}    ## a dash followed by four hex digits
){3}                ## so three groups of that
-?                  ## an optional dash (GUID has one, UUID does not)
[0-9a-f]{12}        ## the last 12 hex digits
$                   ## right to the end


This enables the "^" and "$" anchors to work at the beginning and end of lines, not the entire string. We could match lines that have single words in them in this William Carlos Williams poem:

so much depends

a red wheel

glazed with rain

beside the white
(which I "studied" in my brief foray into English Lit. at university before being kicked out for not going to lectures or tutorials, instead just going to the pub. This poem stuck with me though).

So we could use "(?m)^\S+$" as a pattern to match "upon", "barrow", "water" and "chickens." If that's what you wanted to do.


As previously stated, this just makes the pattern case-insensitive.


If you're familiar with Javascript regexes, you're probably aware of the "g" flag, eg:


This denotes to perform the match globally, not just find the first match then quit. This isn't supported in ColdFusion, instead use the "ALL" scope when doing an reReplace(), or use fourLetterWords = rereplace(poem, "(\b\w{4}\b)", "\U\1\E", "ALL");
One can also do single-character case conversion using "\u" and "\l". We could make Williams turn in his grave by capitalising the first word of each line:

capitalised = reReplaceNoCase(poem, "(?m)^(\w+\b)", "\u\1", "ALL")

And that's that really as far as the syntax goes... this was a bit of a short one. I've now caught up with the stuff I wrote before I started publishing these articles, but I still have a bit more to cover. I'm about to start writing an article on how all this fits in to CFML: the functions and tags which use regular expressions; plus I want to write something on Java's regex implementation, and possibly Javascript's too (given we all (?) use JS too, on a daily basis, so it's at least tangential for ColdFusion developers).

[to be continued]