Once again (prev: "Regex for simplifying string manipulation logic") I found myself able to slough off a coupla dozen lines of code in a CFLib.org UDF, by using a regex (well: two) in place of a bunch of looping and branching logic.
There's nothing new in this article, but a good real-world demonstration of where regexes can replace logic.
This UDF (
titleCaseList()
) got on my radar because someone mentioned a bug it had today, and looking at the comments, there was an earlier bug still outstanding. So I decided to quickly fix it. Here's the code for the previous version:function TitleCaseListOriginal( list, delimiters ) {
var returnString = "";
var isFirstLetter = true;
// Loop through each character in list
for ( i = 1; i LTE Len( list ); i = i + 1 ) {
// Check if character is a delimiter
if ( Find( Mid(list, i, 1 ), delimiters, 1 ) ) {
// Add character to variable returnString unchanged
returnString = returnString & Mid(list, i, 1 );
isFirstLetter = true;
} else {
if ( isFirstLetter ) {
// Uppercase
returnString = returnString & UCase(Mid(list, i, 1 ) );
isFirstLetter = false;
} else {
// Lowercase
returnString = returnString & LCase(Mid(list, i, 1 ) );
}
}
}
return returnString;
}
The two bugs are:
- the
delimiters
argument is supposedly optional, but it's actually not defaulted, so it's required; i
is not VARed.
I fixed those, but I started to look at the code. It'd been written with CF5 in mind, but the comments said it was only stable on MX (with not qualified version, so I presume 6.1. No-one used 6.0, right. Right?). Anyway, those're way too old versions to still be writing code for, so I at least decided to upgrade it to use CF9-ready code, things like:
- a proper function definition, with return type and argument types (and requiredness/optionality of arguments!);
- ++, &= operators
I had also put some early-continues in to flatten out the branching logic:
string function titleCaseList(required string list, string delimiters=" ") {
var returnString = "";
var isFirstLetter = true;
for (var i=1; i <= len(list); i++) {
if (find(mid(list, i, 1), delimiters, 1)) {
returnString &= mid(list, i, 1);
isFirstLetter = true;
continue;
}
if (isFirstLetter) {
returnString &= uCase(mid(list, i, 1));
isFirstLetter = false;
continue;
}
returnString &= lCase(mid(list, i, 1));
}
return returnString;
}
(I also ditched the comments which were simply stating the obvious). So that's far more succinct.
But then I came to test it. It worked for the single sample in CFLib, but that's hardly "unit testing", and I started to look at the code to see what I'd need to test (and, yes, I know I should have done the tests first. Shsh), and initially went "how is this thing actually working?" and then after a while of not quite getting it, went "ballocks to that". And then I thought... it's just a variation on the perennial "
capFirst()
" function which should be doable with a regex.Initially I thought "OK, so what I need to do is do a look-behind for either the start of the string or one of the delims, and then capitalise the next character. CFML regexes don't support look-behinds (nor do a coupla testing engines I looked at, actually), so I decided to do it with Java:
string function titleCaseListJava(required string list, string delimiters=" ") {
return createObject("java", "java.util.regex.Pattern").compile("(?<=^|[#delimiters#])(\w)").matcher(list).replaceAll("\u\1");
}
That's well cool, but... it doesn't work. My test case is this:
<cfset myString = "a.christopher lynch-smith">
<cfoutput>
Before: #myString#<br>
After: #TitleCaseListOriginal(myString, ".- ")#<br>
After: #titleCaseListModern(myString, ".- ")#<br>
After: #titleCaseListJava(myString, "-. ")#<br>
</cfoutput>
And the results are:
Before: a.christopher lynch-smith
After: A.Christopher Lynch-Smith
After: A.Christopher Lynch-Smith
After: u1.u1hristopher u1ynch-u1mith
I was pleased and disappointed both, here. I was pleased because I wrote the look-behind pattern without reference, and it worked first time (this is only like the second or third time ever I've used one), so I am figuring I might be getting confident with regular expression patterns now.
But I was disappointed because the case-control replacement didn't work. Then I remembered I knew that this didn't work in Java. Well: not "didn't work", just "isn't implemented". It's a vagary of CFML's Apache Oro PCREs. Java can't do case-conversion in a regex. Dammit.
So CF can't do look-behinds... Java can't do case conversion. Grumble.
I scratched my head for a bit and decided I could perhaps make it all work with just standard subexpression captures, and came up with something like this:
string function titleCaseListDraft(required string list, string delimiters=" ") {
return reReplace(list, "(^|[#delimiters#])(\w)", "\1\u\2", "ALL");
}
This worked well in my test rig, but I quickly discovered a failing in how I was handling the character set... The original "delimiters" argument was ".- ". If I put this straight into a regex pattern, I get an error:
Malformed regular expression "(^|[.- ])(/w)".
Reason: Invalid [] range in expression.. |
This is true. Because of the
-
, the regex engine sees [.- ]
as a range between dot and space. However dot comes after space in ASCII, so that's invalid. And not what I want, either. And also there are other cases in which valid delimiter characters would be invalid in a pattern, so this approach wouldn't float.ColdFusion 10 has a function
reEscape()
which might have worked here, but I'm writing this to CF9, so that's no help. I also suspect it'd not help anyhow, due to it not quite being for what I needed it to be.More head scratching (and asking on IRC).
Then I think I have cracked it. I've ended up using a construct I usually poopoo, but it seems a good fit for this situation.
People sometimes write their character sets like this:
[c|a|t]
But the OR is implicit here, so I generally suggest getting rid.
However in my case, if I simply represent each character of the delimiter string separated with an OR, then this defuses their "special meaning", so
[.|-| ]
will mean exactly what I want. So now I have this:string function titleCaseList(required string list, string delimiters=" ") {
var regexSafeDelims = reReplace(delimiters, "(.)(?=.)", "\1|", "ALL");
return reReplace(list, "(^|[#regexSafeDelims#])(\w)", "\1\u\2", "ALL");
}
Two lines of code instead of 30-odd.
I quite like that... it seems to work... and I definitely learned more about regex patterns today.
Win!
Now it's 8:15pm and I'm still in the office. I'll leave you here and get cracking, home.
Righto.
--
Adam