Thursday, 22 May 2014

CFML: Regex to the rescue again

G'day:
Once again (prev: "Regex for simplifying string manipulation logic") I found myself able to slough off a coupla dozen lines of code in a CFLib.org UDF, by using a regex (well: two) in place of a bunch of looping and branching logic.

There's nothing new in this article, but a good real-world demonstration of where regexes can replace logic.


This UDF (titleCaseList()) got on my radar because someone mentioned a bug it had today, and looking at the comments, there was an earlier bug still outstanding. So I decided to quickly fix it. Here's the code for the previous version:


function TitleCaseListOriginal( list, delimiters ) {

    var returnString = "";
    var isFirstLetter = true;
    
    // Loop through each character in list
    for ( i = 1; i LTE Len( list ); i = i + 1 ) {
    
        // Check if character is a delimiter
        if ( Find( Mid(list, i, 1 ), delimiters, 1 ) ) {
            
            //    Add character to variable returnString unchanged
            returnString = returnString & Mid(list, i, 1 );
            isFirstLetter = true;
                
        } else {
        
            if ( isFirstLetter ) {
            
                // Uppercase
                 returnString = returnString & UCase(Mid(list, i, 1 ) );
                isFirstLetter = false;
                    
            } else {
                
                // Lowercase
                returnString = returnString & LCase(Mid(list, i, 1 ) );
                
            }
            
        }
        
    }
    
    return returnString;
}

The two bugs are:

  • the delimiters argument is supposedly optional, but it's actually not defaulted, so it's required;
  • i is not VARed.
Both easy mistakes to make.

I fixed those, but I started to look at the code. It'd been written with CF5 in mind, but the comments said it was only stable on MX (with not qualified version, so I presume 6.1. No-one used 6.0, right. Right?). Anyway, those're way too old versions to still be writing code for, so I at least decided to upgrade it to use CF9-ready code, things like:

  • a proper function definition, with return type and argument types (and requiredness/optionality of arguments!);
  • ++, &= operators
Stuff like that.

I had also put some early-continues in to flatten out the branching logic:
string function titleCaseList(required string list, string delimiters=" ") {
    var returnString = "";
    var isFirstLetter = true;
    
    for (var i=1; i <= len(list); i++) {
        if (find(mid(list, i, 1), delimiters, 1)) {
            returnString &= mid(list, i, 1);
            isFirstLetter = true;
            continue;
        }        
        if (isFirstLetter) {
             returnString &= uCase(mid(list, i, 1));
            isFirstLetter = false;
            continue;
        }
        returnString &= lCase(mid(list, i, 1));
    }
    return returnString;
}

(I also ditched the comments which were simply stating the obvious). So that's far more succinct.

But then I came to test it. It worked for the single sample in CFLib, but that's hardly "unit testing", and I started to look at the code to see what I'd need to test (and, yes, I know I should have done the tests first. Shsh), and initially went "how is this thing actually working?" and then after a while of not quite getting it, went "ballocks to that". And then I thought... it's just a variation on the perennial "capFirst()" function which should be doable with a regex.

Initially I thought "OK, so what I need to do is do a look-behind for either the start of the string or one of the delims, and then capitalise the next character. CFML regexes don't support look-behinds (nor do a coupla testing engines I looked at, actually), so I decided to do it with Java:

string function titleCaseListJava(required string list, string delimiters=" ") {
    return createObject("java", "java.util.regex.Pattern").compile("(?<=^|[#delimiters#])(\w)").matcher(list).replaceAll("\u\1");
}

That's well cool, but... it doesn't work. My test case is this:

<cfset myString = "a.christopher lynch-smith">

<cfoutput>
Before: #myString#<br>
After: #TitleCaseListOriginal(myString, ".- ")#<br>
After: #titleCaseListModern(myString, ".- ")#<br>
After: #titleCaseListJava(myString, "-. ")#<br>
</cfoutput>

And the results are:

Before: a.christopher lynch-smith
After: A.Christopher Lynch-Smith
After: A.Christopher Lynch-Smith
After: u1.u1hristopher u1ynch-u1mith


I was pleased and disappointed both, here. I was pleased because I wrote the look-behind pattern without reference, and it worked first time (this is only like the second or third time ever I've used one), so I am figuring I might be getting confident with regular expression patterns now.

But I was disappointed because the case-control replacement didn't work. Then I remembered I knew that this didn't work in Java. Well: not "didn't work", just "isn't implemented". It's a vagary of CFML's Apache Oro PCREs. Java can't do case-conversion in a regex. Dammit.

So CF can't do look-behinds... Java can't do case conversion. Grumble.

I scratched my head for a bit and decided I could perhaps make it all work with just standard subexpression captures, and came up with something like this:

string function titleCaseListDraft(required string list, string delimiters=" ") {
    return reReplace(list, "(^|[#delimiters#])(\w)", "\1\u\2", "ALL");
}

This worked well in my test rig, but I quickly discovered a failing in how I was handling the character set... The original "delimiters" argument was ".- ". If I put this straight into a regex pattern, I get an error:

Malformed regular expression "(^|[.- ])(/w)".

Reason: Invalid [] range in expression..


This is true. Because of the -, the regex engine sees [.- ] as a range between dot and space. However dot comes after space in ASCII, so that's invalid. And not what I want, either. And also there are other cases in which valid delimiter characters would be invalid in a pattern, so this approach wouldn't float.

ColdFusion 10 has a function reEscape() which might have worked here, but I'm writing this to CF9, so that's no help. I also suspect it'd not help anyhow, due to it not quite being for what I needed it to be.

More head scratching (and asking on IRC).

Then I think I have cracked it. I've ended up using a construct I usually poopoo, but it seems a good fit for this situation.

People sometimes write their character sets like this:

[c|a|t]

But the OR is implicit here, so I generally suggest getting rid.

However in my case, if I simply represent each character of the delimiter string separated with an OR, then this defuses their "special meaning", so [.|-| ] will mean exactly what I want. So now I have this:

string function titleCaseList(required string list, string delimiters=" ") {
    var regexSafeDelims =  reReplace(delimiters, "(.)(?=.)", "\1|", "ALL");
    return reReplace(list, "(^|[#regexSafeDelims#])(\w)", "\1\u\2", "ALL");
}

Two lines of code instead of 30-odd.

I quite like that... it seems to work... and I definitely learned more about regex patterns today.

Win!

Now it's 8:15pm and I'm still in the office. I'll leave you here and get cracking, home.

Righto.

--
Adam