Thursday, 17 January 2013

Regular expressions in CFML (part 8: the rest of CFML's support for regular expressions)

G'day:

[Copy-and-paste-from-the-previous-article alert!]

The previous sixseven entries in this series discuss what a regular expression is:
And the syntax of the regex engine ColdFusion uses:
Now I have moved on to discuss the specifics of how CFML implements regular expression technology in the language:
This article will cover the rest of the functions / tags in CFML which utilise regular expressions.

reMatch()

ColdFusion 8 added reMatch() and reMatchNoCase() to CFML. I have to say I think their utility is lacking. On one hand it's handy in that it returns all matches for the pattern from within the given string (whereas reFind() only matches the first), on the other hand it doesn't return any sub-expression info, which is kinda integral to using regular expressions. I would say more than 50% of my regex work requires isolating sub-expressions as well as the main match, so reMatch() is really no use to me.  Equally, I can't help but wonder whether adding a new argument to reFind() might have been a better way to go? I guess that can be argued either way. What we've ended up with is two functions with a reasonable amount of overlap, but neither of them doing the required job thoroughly. I've raised an E/R to deprecate reMatch() and shift its functionality over to reFind(): 3321666.

OK, so what does reMatch() do? Well it takes a regular expression pattern, and returns all the matches of it from within a string.  EG:

seuss    = "ika tahi, ika rua, ika whero, ika kikorangi, ika mangu, ika kikorangi, ika tawhito, ika hou";
regex    = "(?i)\b(\w+)\s+(whero|kikorangi|mangu)(?=,|$)";

colouredThings = reMatch(regex, seuss);
writeDump(colouredThings);

This yields:
 
array
1ika whero
2ika kikorangi
3ika mangu
4ika kikorangi

That is superficially useful: we've found the overall matches. But note my regex also had a sub-expression - which CF should have returned - that has been completely ignored from the result.

Beer
If you tell me what that Maori in the code example says, I'll buy you a beer next time I see you. The offer does not extend to the guys at work who I've already mentioned this to.


A more useful approach is demonstrated by this function:

array function reFindMatches(required string regex, required string string){
    var startPos    = 1;
    var results        = [];

    do {
        var matches = reFind(regex, string, startPos, true);
        if (matches.pos[1]){
            var match = [];
            for (var i=1; i <= arrayLen(matches.pos); i++){
                var result = {};
                result.pos = matches.pos[i];
                result.len = matches.len[i];
                if (matches.pos[i] == 0){
                    result.match    = javacast("null", "");
                }else if (matches.len[i] == 0){
                    result.match    = "";
                }else{
                    result.match    = mid(string, matches.pos[i], matches.len[i]);
                }
                arrayAppend(match, result);
            }
            arrayAppend(results, match);
            startPos = matches.pos[1] + matches.len[1];
        }
    } while(matches.pos[1]);

    return results;
}

The result of using this with the same string / regex would be:

array
1
array
1
struct
LEN9
MATCHika whero
POS20
2
struct
LEN3
MATCHika
POS20
3
struct
LEN5
MATCHwhero
POS24
2
array
1
struct
LEN13
MATCHika kikorangi
POS31
2
struct
LEN3
MATCHika
POS31
3
struct
LEN9
MATCHkikorangi
POS35
3
array
1
struct
LEN9
MATCHika mangu
POS46
2
struct
LEN3
MATCHika
POS46
3
struct
LEN5
MATCHmangu
POS50
4
array
1
struct
LEN13
MATCHika kikorangi
POS57
2
struct
LEN3
MATCHika
POS57
3
struct
LEN9
MATCHkikorangi
POS61

This still returns an array element for each of the matches, but it returns the sub-expression details too (pos, len, match for each). I've also organised the len/pos/match bits more sensibly: as they relate to each match, rather than separate arrays of pos, len and match data (which has always struck me as an odd way for CF to present the results of an reFind() operation.

I would say that for the majority of situations I have a regex to match, the regex also has sub-expression which are actually the important bit of the match, rather than the overall match. Mileage might vary, I guess.

reMatch() doesn't include an arugment for starting position either, which is a bit at odds with how reFind() works. Although I guess the reason reFind() has a start position is so one can do what I've done in that UDF: shift along the string to after a match to start looking for the next match. This is not necessary for reMatch(), I guess. Fair enough.

That's about it as far as reMatch() goes. It's not really very useful or well-thought-out, I think.

reReplace()

As well as locating stuff in a string, regexes can also be used to make replacements in strings. At its basic level reReplace() works analogous to replace(): it matches something, and replaces it:

id            = createUuid();
matchRegex    = "[[:xdigit:]]";
maskChar    = "X";

mask = reReplace(id, matchRegex, maskChar, "ALL");

writeDump({id=id, mask=mask});

Result:
struct
ID3DA0EA99-D067-E5E6-F12EE3ECDDBFF653
MASKXXXXXXXX-XXXX-XXXX-XXXXXXXXXXXXXXXX

Aside: one thing I noticed here is that those POSIX character classes are case-sensitive. I initially wrote [[:XDIGIT:]], and that didn't work.

Better than that they can use sub-expressions too:

usDateMatchPattern            = "^(\d{2})/(\d{2})/(\d{4})$";
us2IsoDateReplacePattern    = "\3-\1-\2";

dateUsFormat                = "03/24/2010";

dateIsoFormat                = reReplace(dateUsFormat, usDateMatchPattern, us2IsoDateReplacePattern, "ONE");

writeDump({us=dateUsFormat, iso=dateIsoFormat});

Result:
struct
ISO2010-03-24
US03/24/2010

Note the "trick" (not really a trick, as it's standard regex practice, and in the docs ;-) here is that one references the back-references via the "\n" syntax, where the n is the number of the subexpression from the match pattern. One thing to note from this is that the number a back reference gets is based on where the sub-expression's opening parenthesis occurs, not its closing one. There could be an argument made both ways here, but it's when it starts, not when it's completed.  EG:

datePattern                    = "(?i)^(\d{1,2}) (([a-z]{3})\S*) (\d{2}(\d{2}))$";
extractedDatePartPattern    = "d:\1; mmmm:\2; mmm:\3; yyyy:\4; yy:\5";

date                        = "3 March 2010";

parts                        = reReplace(date, datePattern, extractedDatePartPattern, "ONE");

writeDump({date=date, parts=parts});

struct
DATE3 March 2010
PARTSd:3; mmmm:March; mmm:Mar; yyyy:2010; yy:10

The other thing to remember with referencing sub-expressions by back-reference is something I mentioned in a previous article: if one wishes to distinguish between the nmth back reference and the nth back reference followed by an m, one can separate n from m using \E. EG:
  • \12 refers to the 12th sub-expression;
  • \1\E2 refers to the first sub-expression, followed by a 2.

reEscape()

This was added in CF10. There's not much to say here, I'll just summarise the docs (knowing me... my summary will end up being longer than what the docs have, I betcha ;-). As you know some characters have special meaning in a regular expression, eg: [], {}, etc. This function takes a string and escapes any regex-meaningful characters it contains. Why would one need to do this? Well it doesn't crop up very often, and I've drawn a blank when it came to come up with a real world example of where it might be useful. This says something about either:
  • my imagination;
  • how often one needs to do this sort of thing.
I have needed to escape regexes in the past, but it predates reEscape() existing, and I cannot find the code now.

Basically sometimes you might not simply be hard-coding your regular expression pattern, it might be built using some values sourced from elsewhere (user input, a DB call etc), and there will be no guarantee that the dynamic value will be "regex-friendly". For example if it had an opening parenthesis in it but no closing one, that'll cause an error if it's passed to the regex engine. Or even if it doesn't actually error, having a character like "." in a normal string means "full-stop" or "a decimal point", whereas if that's passed into the regex engine, it'll mean "any one character", so any matches might not be what was intended (finding "image.jpg" will also match "imagexjpg"... this is unlikely to ever happen, but it demonstrates the case). So to prevent the single parenthesis or the dot breaking things or acting in an unexpected fashion, use reEscape() on the string first, to turn any regex-meaningful character in it into its literal counterpart.

If anyone comes up with a good real-world example I can use for this, it'd be great if you could flick me some demo code and I'll include it here. The docs (linked to above) have a basic example, but it's not in-context, which is not so useful as a learning experience.

<cfparam> / param

(no link for param, sorry... there's no coherent docs for it. What Adobe have said is basically limited to "it exists", which is not very helpful).

Anyway, <cfparam> is mostly used to default variables with a value if they don't already have one, but it's also really useful for validating that the variable value is of the intended type. And one of the validation options is to check the value against a regulat expression pattern, eg:

<cfparam name="URL.id" type="regex" pattern="[[:xdigit:]]{8}(?:-[[:xdigit:]]{4}){3}-?[[:xdigit:]]{12}">

This ensures the ID parameter is a UUID or GUID, and if it isn't, it raises an "Expression" exception, with the message/detail of "Invalid parameter type. The value does not match the regular expression pattern provided.". It'd be really nice if CF gave it a better exception type than just "Expression", but that's a minor gripe.


There's two things to note here:
  • Despite not having to specify as much, there's an implicit "^" and "$" around the regex used. IE: the entire parameter value must match the pattern.
  • It's a case-sensitive pattern. So [A-Z]{4} - for example - won't match "tahi", but will match "TORU". So if you want a case-insensitive check, remember to start the pattern with (?i).

isValid()

This works much the same as <cfparam> in an equivalent situation, but doesn't raise an exception, it returns a boolean:

<cfoutput>#isValid("regex", URL.id, "[[:xdigit:]]{8}(?:-[[:xdigit:]]{4}){3}-?[[:xdigit:]]{12}")#</cfoutput>

Again, this is case-sensitive, so use the flag to make it insensitive if you need to.

<cfinput>

<cfinput> is a funny old thing, innit? And whilst investigating how its regex support works has reinforced that notion for me.

With <cfinput>, one can have this sort of thing:

<cfif structKeyExists(form, "btnSubmit")>
    <cfdump var="#form#">
</cfif>
<cfform method="POST" action="#CGI.script_name#">
    SKU (XXX-999): <cfinput type="text" name="sku" required="true" pattern="^[A-Z]{3}-\d{3}$" validate="regular_expression" />
    <input type="submit" name="btnSubmit" value="Submit" />
</cfform>

Here we accept an SKU which must be three capital letters, a dash, then three digits. This works adequately, but there's a number of idiosyncrasies to note:
  • in the default situation, the regular expression processing is done via Javascript on the browser, so the regex needs to follow Javascript's syntax rules, not ColdFusion's (there are a few differences, but nothing major that I'm aware of);
  • the regex is case-sensitive, and one cannot use (?i) to make it case-insensitive because that's one of the things Javascript doesn't support, so one's gotta hard-code a case-insensitive pattern if that's what one needs;
  • the regex doesn't automatically match the whole string the user has entered, so if that's what one wants, one needs to put the "^" and "$" in. Note that if I didn't have that in my regex above, then "XXXX-9999" would pass validation too.
Also note that one can opt to use the validateat="onserver", in which case the regex is processed by ColdFusion.

About "server-side" validation via <cfinput>
Don't be lulled into a false sense of security by thinking using validateat="onserver" gives you valid server-side validation here. It does not.  All this option does is to add some hidden form fields to your form which are then detected by ColdFusion when the submitted form request arrives at the server, and used to peform validation. So: server-side, right? Well yes, but the whole point of doing server-side validation over client-side is that the server-side validation cannot be circumvented on the client. However because the control of the validation is embedded in the client-side mark-up, it is very easy to remove it (like with Firebug), or simply perform the POST request from one's own form which doesn't have these form fields. So this does not really count as "server-side validation".

What's more, I really question the point of even doing this for any reason. I cannot see a way to intercept the submitted form and then interact with this validation... ColdFusion seems to do it all outside of the normal request lifecycle. onError() will trap the exception raised by a validation failure I guess, but the exception that's thrown is not a "FormValidationException" or something useful like that, it's just of type "Application". "Interestingly" this validation check is done before onRequestStart() is called, too. I dunno how much of an issue that is, but it all makes the whole thing seem:
  • half-baked;
  • a security risk (if people think it's doing proper server-side validation);
  • just a waste of time.
I say: don't use this feature. And if you have already: remove it, and do proper server-side validation.

I don't think there's much left to say about <cfinput> and regular expressions. Personally along with all the rest of the client-side validation <cfinput> offers, I'd not use it: it's a bit shonky. And it's just much easier, polished and professional-looking to effect the same functionality using a proper JS library / framework rather than these after-thoughts in CFML.



OK, I think that covers all the CFML functionality that uses regular expressions? I asked the rest of the team here, and no-one could think of anything else (and most of us - meself included - didn't even know about reEscape()!). If anyone can think of anything else in CFML that uses 'em: sing out and I'll update the article.

This concludes the "ColdFusion" part of this series, but I've decided that having a look at Javascript regular expressions and also Java ones would be useful knowledge to share with CF developers, so I'll write another article or two on those. I need to get up to speed with them first, though!

Also, if there's any Railo- or OpenBD-specific stuff that leverages regexes which ColdFusion doesn't: let me know, and I'll write an annexe covering that too.

Cheers for now.

--
Adam