Wednesday, 26 December 2012

Regular expressions in CFML (part 5: syntax - look-arounds, and how the engine parses the string it's matching)

G'day:
This is part five of the series I started with the introduction article: "Regular Expressions in ColdFusion (part 1: overview)", and followed with a discussion entitled "Regular expressions in ColdFusion (part 2: concepts)". Then I moved onto syntax with "Regular expressions in ColdFusion (part 3: syntax - single characters)" and "Regular expressions in ColdFusion (part 4: syntax - repetition, sub-expressions and back-references)".

Please note that this article - more so than the other ones - does not stand-alone. It's more a "part 2" of the preceding article. I advise reading at least that one before reading this one. But, really, one should read the whole lot in order. But bring food and water with you before starting.

Look-arounds

Another type of sub-expression is a look-around. Look-arounds do what the name suggests: they look around to see if there's a match (or there specifically isn't a match). They can look ahead from the current position in the string being processed, or they can look behind. So we have four different sorts of look-around:
  • positive look-ahead;
  • negative look-ahead;
  • positive look-behind;
  • negative look-behind.
ColdFusion's regex engine only supports look-aheads (both positive and negative ones).

Before I explain how these work, let's back up a bit and examine what goes on when a regular expression pattern matching exercise takes place on a given string.


How is a pattern matched

Let's say we're looking for words which begin with a double vowel:

\b([aeiou])\1.+\b

So to recap:
\b matches the beginning or end of a word;
([aeiou])\1 matches a vowel which occurs twice in a row (so "ee" but not "ea");
.+ matches any letter, one or more times;
\b matches the beginning or end of a word.

Now let's say we're using that regular expression pattern to find a match in this string:

rubber aardvark


The engine starts at the beginning of the string, first looks for the boundary at the start or end of a word, and finds one:

|rubber aardvark

(I'm gonna use the pipe character to show the word boundaries... just imagine it's the position I'm indicating, rather than the actual character. So it's a zero-width match).

So it skips the width of that match (which is zero), and looks from the next character for a match of two vowels:

|rubber aardvark

No good. It definitely can't find any more of the pattern from this position, so it starts looking again, this time starting from the next character in the string:

rubber aardvark

It was looking for a word boundary again, and failed, so the pattern can't be found, so it starts again from the third (fail), fourth (fail), etc until it gets past the end of "rubber", and it finds a word boundary again:

rubber| aardvark

But the following characters are not two vowels:

rubber| aardvark

So it starts again on the other side of the space (again, it's skipping one character over every time it restarts trying to find the pattern). Now we have some luck:

rubber |aardvark

And some more luck:

rubber |aardvark

etc, until we match the entire word (and the word delimiters at either end, making sure it is a word):

rubber |aardvark|

At this point the engine will report back "got a match of that at position 8" (there's no subexpression capturing the match, so it does not report back "and the match was aardvark" or "and the match was 8 characters long" or anything like that... it just reports back "8".

That's reasonably easy to follow, and the two things to note here are these:
  • Every time an attempt to match an atomic part of the pattern fails, the process starts again, but shifted over by one character.
  • Every time an atomic part of the pattern is matched, the next atom is looked for over by the width of the previous atom match (eg: after matching "aa" at postion 8 & 9, the engine looks for the "\w*" starting from the 10th position.
Why am I telling you this, which - really - is pretty obvious? Well because look-arounds are zero-width. What this means is that the engine looks for the pattern in the look-around sub-expression, but if it's matched, the next atom is looked for from the same position. The ramifications of this are critical to how look-arounds are useful, and I'll explain why as we go...

Back to look-arounds: look-aheads

To demonstrate how look-arounds work, and demonstrate the significance of this whole "zero-width" thing, I'm gonna use a real-world example that you'll probably encounter (in case extracting the word "aardvark" from "rubber aardvark" wasn't real-world enough for you ;-). The regex pattern in this example verifies that a password string contains mixed-case letters, plus at least one digit and one other character (like some punctuation). Unfortunately it's one if those patterns that looks like a punctuation convention is taking place on your screen, but hopefully that's a less daunting prospect now, and with deft use of garish colours, I should be able to help you deconstruct it in your head into understandable chunks. Here goes:

^(?=.*[A-Z])(?=.*[a-z])(?=.*[0-9])(?=.*[^a-zA-Z0-9]).{8,}$

OK, "eek" is an entirely appropriate reaction to that. But let's break it down.

First things first, the syntax for a positive-lookahead is done via a modified subexpression. Remember a sub-expression is denoted by parentheses: "(A|B)" is a subexpression representing a match of A or B. And one can modify a sub-expression to make it not "remember" what it matched (just remembering that it did match): "(?:A|B)". A positive look-ahead uses a different modifier within the first parenthesis: "(?=A|B)". And whilst we're about it, a negative one is thus: "(?!A|B)". We'll talk about negative look-aheads further down. Let's focus on positive ones to start with.

Back to breaking down this pattern into understandable (hopefully) chunks:

^ means "from the beginning of the string";
(?=.*[A-Z]) this is a positive look-ahead which looks for zero-or-more of any character, followed by a capital letter;
(?=.*[a-z]) as above, except looks for a  lower case letter;
(?=.*[0-9]) samesame, except a digit;
(?=.*[^a-zA-Z0-9]) and finally some character other than a letter (either case) or a letter. EG: some punctuation or something;
.{8,} now we're actually matching characters, but it can be any character as long as there's eight or more of them;
$ and this matches the end of the string.

That's all well and good, but why are we doing this? And why do we need all this look-ahead nonsense instead of just matching the actual characters? And we seem to be matching the characters we want in the string, and then matching another eight characters after that. Trust me: one needs to do it this way. Here's why...

Let's rewrite that regex without the look-aheads:

^([A-Z]+[a-z]+[0-9]+[^a-zA-Z0-9]+]){8,}$

At initially glance, that seems to match an eight-or-more-character string made up of at least a capital letter, a lowercase letter, a digit, and some other "special" character.

However if you desk-test this (go on, it's a good exercise), does it work (hint: no)? This pattern only works if the capital letters come first, then the lowercase ones, then the digits, then the punctuation. Can you see that?

Right, so we decide to try to come up with some sort of pattern to get around this by looking for zero-or-more of the other character type requirements before and each one we're checking for, and start writing something the size of War and Peace in regex syntax before realising that that won't work either because it might match actual matches, but it will also match a lot of false positives. The only way I'm aware of doing this is with positive look-aheads, and it works nicely.

So what does the look-ahead do that helps us here? Here's a section of the desk-test of the look-ahead version. I'm using the string "!1Aaaaaaa" (which has a capital, a lowercase letter, a digit and some punctuation. Secure as a secure thing).

First we look from the beginning of the string ("^") and look-ahead for .*[A-Z], which is zero or more of any character, followed by a capital letter:

!1Aaaaaaa

Bingo. Now onto the next atom of the regex: (?=.*[a-z]). This is zero or more of any character followed by a lowercase letter. The difference is because the first match was a look-ahead (and is zero width), we don't resume looks from the first "a", we go back to where that previous look-ahead started from - just after the beginning of the string in this case, and start from there. This next look-ahead finds a match thus:


!1Aaaaaaa

And then the next one - (?=.*[0-9]) - does much the same, again from the beginning of the string:

!1Aaaaaaa

Then the last look-ahead - (?=.*[^a-zA-Z0-9]) which is looking for the "punctuation" character" does the same:

!1Aaaaaaa

(note that there's a zero-length match for ".*" at the beginning of that).

So thus far we have "looked-ahead" and verified we have matched at least one each of the specific characters we need to fulfill the password restrictions, we just need to know it's the correct length. Because all of the sub-expressions so far have been look-aheads (and zero length), the "start position" is still just after the beginning of the string. So starting from there, we look for the last bit of the pattern: eight or more of any character , through to the end of the string (".{8,}$"):

!1Aaaaaaa

Does that all make sense? Hopefully with all the colours and stuff it does. The key point to remember here is the notion of look-arounds being zero-width, so the regular expression engines goes and makes sure the match is there, but otherwise doesn't include it when processing the rest of the pattern.

Negative look-aheads

It took me an age to get negative look-aheads straight in my small brain. This was during a time I could kinda make look-aheads work, but hadn't really worked out why they worked. Or indeed how.  But I once one gets a handle on how look-arounds work, then positive... negative... it's all the same process.  A negative look-ahead simply does exactly the same thing a positive one does, except confirms the absence of the pattern. Remember I said before that one cannot do this to confirm a string does not contain "CAT": "^.*[^CAT].*$". No, it checks the next character is not a C, A or T. But only checks one character. One can also not get clever and do this: "^.*[^C][^A][^T].*$" because the "[^C][^A][^T]" will match any other three characters in the string.  EG, the analysis of the match is as follows:



|Will you find a CAT in this sentence.|

^.*[^C][^A][^T].*$

Note that the second ".*" indeed matches zero characters there, but "*" is "zero or more", so that's A-OK.

A negative look-ahead does the trick here. They work exactly the same as the positive ones, except makes sure the pattern doesn't get matched. The solution to the cat dilemma is this:

^(?!.*CAT).*$

This looks from the start to the end of the string, within that it looks to make sure there is no "zero or more of any character followed by a CAT", and if there isn't, any number of any characters is fine.

So using the same sentence as before, we get this:

|Will you find a CAT in this sentence.|

At which point the negative look-ahead has failed, so there's no match.

What's a practical use for this? I was scratching my head for a good use case for this (one not involving rubber aardvarks or cats), and my colleague Duncan came up with a good one which we needed to use to check some of our code. He'd spotted that we had a coupla CFC methods which - for some reason, probably a leftover from some debugging - had OUTPUT="TRUE" on them. He wanted to check which methods had that, and also - being diligent -  also wanted to check which ones had no OUTPUT attribute specified at all.  For the latter, he used a negative look-ahead. I don't have his exact pattern to hand, but this works (it could be finetuned, yes, before anyone mentions this... I'm trying to keep the examples simple):

<cffunction(?![^>]+output\s*=\s*")[^>]+>

This can be used in ColdFusion Builder or Eclipse to check the codebase for any functions without an OUTPUT attribute. Handy.

To break this down, I'm gonna first extract the negative look-ahead bit and see what's left:

<cffunction[^>]+>

This matched the opening of a "<cffunction>" tag, followed by any characters that are not a ">", then followed by a ">". This is a usual sort of approach to matching a angle-bracket tag in a string. All we do then is to stick the negative look-ahead into it, which - in effect - says "that don't have "output=" in them, at any stage before we get to the closing ">". Note I've done the pattern with a "\s*" in there, because we want to (not) match both of these:

<cffunction name="f" output="false">
<cffunction name = "f" output = "false">

Indeed this'll also match a Nadel-esque style of trying to use up all the world's whitespace in one statement:

<cffunction
    name    = "f"
    output  = "false"
>

(TBH, if I found code like that in our codebase there'd be more trouble than simply a missing OUTPUT attribute. How to get the blood out of the carpet for starters... ;-)

Anyway, the "\s*" just means "any whitespace (or none)".

I'm not gonna step through that one because it's the same as the previous example, and you should be getting it by now (and if you're not, another example won't help, I think. I don't mean that in a bad way, but it means my way of explaining 'em ain't working!).

Look-behinds

My exposure to regular expressions is limited to ColdFusion's implementation - which doesn't support look-behinds - so I have no expertise or experience with them. I am presuming they work the same as look-aheads, but look backwards in the string before the current position. That would make sense. I'm not going to waffle too much about something I've not used, so I'll talk about something else now.

[well in the next article, anyhow. TBC...]

--
Adam