G'day:
The previous six entries in this series discuss what a regular expression is:
And the syntax of the regex engine ColdFusion uses:
So far, I've not really mentioned CFML in any of my examples: I've just dealt with the regex syntax. In this article (update: and the subsequent few) I'll look at the CFML tags and functions which use regular expressions.
The most basic operation one can perform with a regular expression in CFML is to simply check to see if the regex matches anything in a string. This is done with
reFind()
and its case-insensitive counterpart
reFindNoCase()
. These functions work exactly the same except for the case-sensitivity, so I'll just dwell on
reFind()
. Similarly for
reMatch()
and
reReplace()
later on: their
~NoCase()
counterparts work the same, so other than identifying they exist, I'll comment no further on them.
Apologies for my idiosyncratic way of lower-camel-casing the "re" bit on the functions... I cannot bring myself to write a function with an initial capital letter. I know it's a bit odd.
Basics of reFind()
reFind() (see even at the beginning of a sentence... no cap... ;-) works in one of two ways. The most basic is just to work like find() does: return the index of where a match was found:
idx = reFind("\d", "ABC123");
This returns 4 (the position the first digit (\d) - "1" - is at). This operation is not much use beyond that, to be honest. One cannot reliably tell
what was matched, because obviously it's a pattern being matched, but also because the length of a match cannot necessarily be inferred from the pattern, so one can't necessarily use
mid()
to extract the match like one might be able to with a straight string find (one knows the length of the substring being sought, after all). For example, there's no way to infer from a match of "\d*" whether the match was one digit or ten digits. Or indeed zero digits. So it's only good for a throwaway "was it there?" check.
Note that it returns 0 if there was no match (even a zero-width match before the first character of the string results in "1" being returned).
Sub-expressions
Fortunately there's another argument
reFind()
can take to make it actually useful.
match = reFind("\d{2}", "ABC123", 1, true);
This time, we get back more info:
We get back a struct with keys
pos
and
len
, and each of those is an array. The values in the array enable us to extract what exactly was matched, eg:
s = "ABC123";
matches = reFind("\d{2}", s, 1, true);
match = mid(s, matches.pos[1], matches.len[1]);
This results in match having a value of "12": this was the two-digit substring matched by "\d{2}".
The third argument to
reFind()
is the position in which to start the match (same as the equivalent for
find()
), and it's not actually relevant here, but because CFML only supports named arguments on a few functions (I'm not sure why some do and some don't?), we need to specify it so as to be able to also specify the argument we're interested in: the
true
. The fourth argument is a flag as to whether to return subexpression info, and using true means one gets back that struct rather than just the integer index.
On initial take, one might wonder why the
pos
and
len
values are arrays, when there's only one element in the array. Well cast your mind back to the discussion on sub-expressions: a sub-expression's match can be "remembered"... remembered for both a back-reference later in the pattern or in a replacement pattern, but also it can be returned from a find operation too. Here's an example.
numbers = "one two three four five six seven eight nine ten";
regex = "\b\w([aeiou]{2})\w\b";
matches = reFind(regex, numbers, 1, true);
match = mid(numbers, matches.pos[1], matches.len[1]);
subexpression = mid(numbers, matches.pos[2], matches.len[2]);
writeDump(variables);
(That pattern matches a four letter word with two vowels in the middle)
struct |
MATCH | four |
MATCHES |
|
NUMBERS | one two three four five six seven eight nine ten |
REGEX | \b\w([aeiou]{2})\w\b |
SUBEXPRESSION | ou |
See that the arrays now have two elements:
- the first element is the overall match;
- the second element is the match of the sub-expression.
There will be as many elements in the array as there are sub-expressions (plus the first one which is the entire match).
One thing to bear in mind with the "there will be as many elements in the array as there are sub-expressions" is that there will be an element in the array even if the sub-expression could possibly not be matched. Here's an example:
skuPattern = "^(\d{5})([A-Z]{2})?$";
skuBasic = "12345";
matches = reFind(skuPattern, skuBasic, 1, true);
writeDump(matches);
writeOutput("<hr />");
skuVariant = "12345AB";
matches = reFind(skuPattern, skuVariant, 1, true);
writeDump(matches);
Here an SKU can take one of two patterns:
nnnnn or
nnnnnxx where
n is a digit and
x is a letter. The pattern matches both of those, and in the case of the
nnnnxx version, it also grabs the
xx as a sub-expression. We can see this in the results:
Note how for the "basic" SKU, there's still
an array element for the
xx part of the pattern, even though it wasn't matched. So bear in mind to check that the
pos
value is >0 before trying to extract a substring for the sub-expression. Otherwise you'll get this (which will become a very familiar error for you):
The 2 parameter of the Mid function, which is now 0, must be a non-negative integer
|
|
|
The error occurred in D:\websites\www.scribble.local\junk\junk.cfm: line 8 |
6 : writeDump(matches);
7 :
8 : variant = mid(skuBasic, matches.pos[3], matches.len[3]);
|
(nice pidgin English from Adobe there with the error message btw: "the
2 parameter"??)
A variation on this is that a match can be made, but it can be zero length. One needs to be mindful of this too. Consider this example, which shows all four of the possibilities I mentioned:
struct function reFindMatches(required string regex, required string string){
var result = reFind(regex, string, 1, true);
result._match = [];
for (var i=1; i <= arrayLen(result.pos); i++){
if (result.pos[i] == 0){
arrayAppend(result._match, "NO MATCH");
}else if (result.len[i] == 0){
arrayAppend(result._match, "ZERO-LENGTH MATCH");
}else{
arrayAppend(result._match, mid(string, result.pos[i], result.len[i]));
}
}
return result;
}
skuPattern = "^([A-Z]{3})(?:-(?:DEFAULT|(\d*)))?$";
skuVariant = "ABC";
matches = reFindMatches(skuPattern, skuVariant);
writeDump(var=matches, label=skuVariant);
writeOutput("<hr />");
skuVariant = "ABC-DEFAULT";
matches = reFindMatches(skuPattern, skuVariant);
writeDump(var=matches, label=skuVariant);
writeOutput("<hr />");
skuVariant = "ABC-123";
matches = reFindMatches(skuPattern, skuVariant);
writeDump(var=matches, label=skuVariant);
writeOutput("<hr />");
skuVariant = "ABC-";
matches = reFindMatches(skuPattern, skuVariant);
writeDump(var=matches, label=skuVariant);
Here I've added a function
reFindMatches()
that improves
reFind()
in that as well as returning the
pos
and
len
of what was matched, it also returns
what actually was matched. Which it ought to do off the bat, as all one would ever want
pos
/
len
for was to extract the matched substring. Indeed I have raised an E/R to augment
reFind()
and deprecate the mostly useless
reMatch()
:
3321666.
I've also used a pattern that matches four variations of SKU:
- Just the basic "three letters" version;
- Three letters with the "default" variant;
- Three letters with a custom numeric variant;
- Three letters with a zero-length numeric variant (this one stretches credulity, I know, sorry).
Here's the results:
123 - struct |
LEN |
|
POS |
|
_MATCH |
|
ABC - struct |
LEN |
|
POS |
|
_MATCH |
ABC - array |
1 | ABC |
2 | ABC |
3 | NO MATCH |
|
ABC-DEFAULT - struct |
LEN |
ABC-DEFAULT - array |
1 | 11 |
2 | 3 |
3 | 0 |
|
POS |
ABC-DEFAULT - array |
1 | 1 |
2 | 1 |
3 | 0 |
|
_MATCH |
ABC-DEFAULT - array |
1 | ABC-DEFAULT |
2 | ABC |
3 | NO MATCH |
|
ABC-123 - struct |
LEN |
|
POS |
|
_MATCH |
ABC-123 - array |
1 | ABC-123 |
2 | ABC |
3 | 123 |
|
ABC- - struct |
LEN |
|
POS |
|
_MATCH |
ABC- - array |
1 | ABC- |
2 | ABC |
3 | ZERO-LENGTH MATCH |
|
This demonstrates various permutations of matches:
- zero pos and zero len in the first element: no match at all. Equiv to a 0 result from a standard find() operation.
- Zero pos and len in one of the other elements: no match for that specific sub-expression, but over-all there was still a match. This implies the subexpression was optional.
- A positive value for both pos and len: there was a substring match.
- A positive value for pos, but a zero length. There was a match, but it was zero-length.
Hopefully that range of unduly repetitious examples demonstrates the types of matches (or lack thereof) one can expect from
reFind()
.
Starting position
The other argument I glossed over was the "starting position" one - the third argument - there's not much to say about this other than to be aware of it. It works the same as with
find()
. One thing to note, and kind of demonstrates why not using the fourth argument as
true
with
reFind()
isn't very useful, is that if one wants to cycle through
all the matches in a string, the normal approach is to do this sort of thing:
// pseudo code
set findStartPos = 1
while (findOperation matches something){
process the match
set findStartPos = after last char of this match
}
This step is not really possible with a lot of patterns (ones that don't necessarily match a fixed length substring), unless one returns sub-expressions. Without the sub-expression data being returned, one has only the start position of the match, but one doesn't know how long it is, so one can't skip past it.
reFindNoCase()
This function is kinda redundant, given one can simply specify
(?i)
in the regex to make it case-insensitive anyhow.
Blimey. I was hoping to cover all the CFML regex functionality in one article, but I'm up to 1800 words on just
reFind()
. So I'm gonna stop here, before you nod off (if you haven't already), and continue with
reReplace()
in the next article.
Until the next time...
--
Scheherazade