Thursday, 10 January 2013

Regular expressions in CFML (part 7: reFind())

G'day:
The previous six entries in this series discuss what a regular expression is:
And the syntax of the regex engine ColdFusion uses:
So far, I've not really mentioned CFML in any of my examples: I've just dealt with the regex syntax. In this article (update: and the subsequent few) I'll look at the CFML tags and functions which use regular expressions.

The most basic operation one can perform with a regular expression in CFML is to simply check to see if the regex matches anything in a string. This is done with reFind() and its case-insensitive counterpart reFindNoCase(). These functions work exactly the same except for the case-sensitivity, so I'll just dwell on reFind(). Similarly for reMatch() and reReplace() later on: their ~NoCase() counterparts work the same, so other than identifying they exist, I'll comment no further on them.

Apologies for my idiosyncratic way of lower-camel-casing the "re" bit on the functions... I cannot bring myself to write a function with an initial capital letter. I know it's a bit odd.

Basics of reFind()

reFind() (see even at the beginning of a sentence... no cap... ;-) works in one of two ways. The most basic is just to work like find() does: return the index of where a match was found:

idx = reFind("\d", "ABC123");

This returns 4 (the position the first digit (\d) - "1" - is at). This operation is not much use beyond that, to be honest. One cannot reliably tell what was matched, because obviously it's a pattern being matched, but also because the length of a match cannot necessarily be inferred from the pattern, so one can't necessarily use mid() to extract the match like one might be able to with a straight string find (one knows the length of the substring being sought, after all). For example, there's no way to infer from a match of "\d*" whether the match was one digit or ten digits. Or indeed zero digits. So it's only good for a throwaway "was it there?" check.

Note that it returns 0 if there was no match (even a zero-width match before the first character of the string results in "1" being returned).

Sub-expressions

Fortunately there's another arguments reFind() can take to make it actually useful.

match = reFind("\d{2}", "ABC123", 1, true);

This time, we get back more info:

struct
LEN
array
12
POS
array
14

We get back a struct with keys pos and len, and each of those is an array. The values in the array enable us to extract what exactly was matched, eg:

s = "ABC123";
matches = reFind("\d{2}", s, 1, true);

match = mid(s, matches.pos[1], matches.len[1]);


This results in match having a value of "12": this was the two-digit substring matched by "\d{2}".

The third argument to reFind() is the position in which to start the match (same as the equivalent for find()), and it's not actually relevant here, but because CFML only supports named arguments on a few functions (I'm not sure why some do and some don't?), we need to specify it so as to be able to also specify the argument we're interested in: the true. The fourth argument is a flag as to whether to return subexpression info, and using true means one gets back that struct rather than just the integer index.

On initial take, one might wonder why the pos and len values are arrays, when there's only one element in the array. Well cast your mind back to the discussion on sub-expressions: a sub-expression's match can be "remembered"... remembered for both a back-reference later in the pattern or in a replacement pattern, but also it can be returned from a find operation too. Here's an example.

numbers = "one two three four five six seven eight nine ten";
regex = "\b\w([aeiou]{2})\w\b";

matches = reFind(regex, numbers, 1, true);
match = mid(numbers, matches.pos[1], matches.len[1]);
subexpression = mid(numbers, matches.pos[2], matches.len[2]);

writeDump(variables);

(That pattern matches a four letter word with two vowels in the middle)

struct
MATCHfour
MATCHES
struct
LEN
array
14
22
POS
array
115
216
NUMBERSone two three four five six seven eight nine ten
REGEX\b\w([aeiou]{2})\w\b
SUBEXPRESSIONou

See that the arrays now have two elements:
  • the first element is the overall match;
  • the second element is the match of the sub-expression.
There will be as many elements in the array as there are sub-expressions (plus the first one which is the entire match).

One thing to bear in mind with the "there will be as many elements in the array as there are sub-expressions" is that there will be an element in the array even if the sub-expression could possibly not be matched. Here's an example:

skuPattern = "^(\d{5})([A-Z]{2})?$";

skuBasic = "12345";
matches = reFind(skuPattern, skuBasic, 1, true);
writeDump(matches);

writeOutput("<hr />");

skuVariant = "12345AB";
matches = reFind(skuPattern, skuVariant, 1, true);
writeDump(matches);

Here an SKU can take one of two patterns: nnnnn or nnnnnxx where n is a digit and x is a letter. The pattern matches both of those, and in the case of the nnnnxx version, it also grabs the xx as a sub-expression. We can see this in the results:

struct
LEN
array
15
25
30
POS
array
11
21
30

struct
LEN
array
17
25
32
POS
array
11
21
36

Note how for the "basic" SKU, there's still an array element for the xx part of the pattern, even though it wasn't matched. So bear in mind to check that the pos value is >0 before trying to extract a substring for the sub-expression. Otherwise you'll get this (which will become a very familiar error for you):

The 2 parameter of the Mid function, which is now 0, must be a non-negative integer

The error occurred in D:\websites\www.scribble.local\junk\junk.cfm: line 8
6 : writeDump(matches);
7 : 
8 : variant = mid(skuBasic, matches.pos[3], matches.len[3]);

(nice pidgin English from Adobe there with the error message btw: "the 2 parameter"??)

A variation on this is that a match can be made, but it can be zero length. One needs to be mindful of this too. Consider this example, which shows all four of the possibilities I mentioned:

struct function reFindMatches(required string regex, required string string){
    var result = reFind(regex, string, 1, true);

    result._match = [];
    for (var i=1; i <= arrayLen(result.pos); i++){
        if (result.pos[i] == 0){
            arrayAppend(result._match, "NO MATCH");
        }else if (result.len[i] == 0){
            arrayAppend(result._match, "ZERO-LENGTH MATCH");
        }else{
            arrayAppend(result._match, mid(string, result.pos[i], result.len[i]));
        }
    }
    return result;
}


skuPattern = "^([A-Z]{3})(?:-(?:DEFAULT|(\d*)))?$";

skuVariant = "ABC";
matches = reFindMatches(skuPattern, skuVariant);
writeDump(var=matches, label=skuVariant);

writeOutput("<hr />");

skuVariant = "ABC-DEFAULT";
matches = reFindMatches(skuPattern, skuVariant);
writeDump(var=matches, label=skuVariant);

writeOutput("<hr />");

skuVariant = "ABC-123";
matches = reFindMatches(skuPattern, skuVariant);
writeDump(var=matches, label=skuVariant);

writeOutput("<hr />");

skuVariant = "ABC-";
matches = reFindMatches(skuPattern, skuVariant);
writeDump(var=matches, label=skuVariant);

Here I've added a function reFindMatches() that improves reFind() in that as well as returning the pos and len of what was matched, it also returns what actually was matched. Which it ought to do off the bat, as all one would ever want pos / len for was to extract the matched substring. Indeed I have raised an E/R to augment reFind() and deprecate the mostly useless reMatch(): 3321666.

I've also used a pattern that matches four variations of SKU:
  • Just the basic "three letters" version;
  • Three letters with the "default" variant;
  • Three letters with a custom numeric variant;
  • Three letters with a zero-length numeric variant (this one stretches credulity, I know, sorry).
Here's the results:

123 - struct
LEN
123 - array
10
POS
123 - array
10
_MATCH
123 - array
1NO MATCH

ABC - struct
LEN
ABC - array
13
23
30
POS
ABC - array
11
21
30
_MATCH
ABC - array
1ABC
2ABC
3NO MATCH

ABC-DEFAULT - struct
LEN
ABC-DEFAULT - array
111
23
30
POS
ABC-DEFAULT - array
11
21
30
_MATCH
ABC-DEFAULT - array
1ABC-DEFAULT
2ABC
3NO MATCH

ABC-123 - struct
LEN
ABC-123 - array
17
23
33
POS
ABC-123 - array
11
21
35
_MATCH
ABC-123 - array
1ABC-123
2ABC
3123

ABC- - struct
LEN
ABC- - array
14
23
30
POS
ABC- - array
11
21
35
_MATCH
ABC- - array
1ABC-
2ABC
3ZERO-LENGTH MATCH

This demonstrates various permutations of matches:
  • zero pos and zero len in the first element: no match at all. Equiv to a 0 result from a standard find() operation.
  • Zero pos and len in one of the other elements: no match for that specific sub-expression, but over-all there was still a match. This implies the subexpression was optional.
  • A positive value for both pos and len: there was a substring match.
  • A positive value for pos, but a zero length. There was a match, but it was zero-length.
Hopefully that range of unduly repetitious examples demonstrates the types of matches (or lack thereof) one can expect from reFind().

Starting position

The other argument I glossed over was the "starting position" one - the third argument - there's not much to say about this other than to be aware of it. It works the same as with find(). One thing to note, and kind of demonstrates why not using the fourth argument as true with reFind() isn't very useful, is that if one wants to cycle through all the matches in a string, the normal approach is to do this sort of thing:

// pseudo code
set findStartPos = 1
while (findOperation matches something){
    process the match
    set findStartPos = after last char of this match
}

This step is not really possible with a lot of patterns (ones that don't necessarily match a fixed length substring), unless one returns sub-expressions. Without the sub-expression data being returned, one has only the start position of the match, but one doesn't know how long it is, so one can't skip past it.

reFindNoCase()

This function is kinda redundant, given one can simply specify (?i) in the regex to make it case-insensitive anyhow.



Blimey. I was hoping to cover all the CFML regex functionality in one article, but I'm up to 1800 words on just reFind(). So I'm gonna stop here, before you nod off (if you haven't already), and continue with reReplace() in the next article.

Until the next time...

--
Scheherazade