Adam Cameron's Dev Blog: Regex for simplifying string manipulation logic

Monday, 21 April 2014

Regex for simplifying string manipulation logic

G'day:
An interesting blog article fell in front of me this morning: "Capitalization for us Mc’s and Mac’s!", by Brian McGarvie. It mentions a UDF on CFLib.org which handles... well as per his blog title: captialising his name as "McGarvie" rather than "Mcgarvie" like other capitalise() functions might do.

The UDF is thus:

function celticMcCaps(lastName) {
    var capLastName = lCase(lastName);
    if (left(lastName,2) eq "Mc") {
        capLastName = uCase(left(lastName,1)) & lCase(mid(lastName,2,1)) & uCase(mid(lastName,3,1)) & lCase(right(lastName,len(lastName)-3));
        return capLastName;
    }
    else if (left(lastName,3) eq "Mac") {
        capLastName = uCase(left(lastName,1)) & lCase(mid(lastName,2,1)) & lCase(mid(lastName,3,1)) & uCase(mid(lastName,4,1)) & lCase(right(lastName,len(lastName)-4));
        return capLastName;
    }
    else if (left(lastName,2) eq "O'") {
        capLastName = uCase(left(lastName,1)) & "'" & uCase(mid(lastName,3,1)) & lCase(right(lastName,len(lastName)-3));
        return capLastName;
    }
    else return lastName;
}

(thanks to Kyle MacNamara for submitting it, btw).

I had a look at that, and thought "that's a lot of logic when all we're doing is string manipulation".

I have to admit I didn't spot the fact it handles the "O'" prefix at first, and very quickly came out with this:

function celticMcCaps(name){
    reReplaceNoCase(name, "^([M])([a]?c)([a-z])(.*)$", "\U\1\E\L\2\E\U\3\E\L\4\E", "ONE")
}

Which does 2/3rds of the trick. Then when writing this article I spotted the "O'" handling, so revised it to this:

function celticMcCapsRevised(name){
    return reReplaceNoCase(name, "^([MO])((?:[a]?c)|')([a-z])(.*)$", "\u\1\L\2\E\u\3\L\4\E", "ONE");
}

The trick to all this is regular expression replacements can perform case-conversion. \u and \l will convert the next character to their respective cases; \U and \L will convert all subsequent characters to their respective cases, until a \E is encountered. So I use \u to upper-case the first letter, plus the one after the prefix, and \L to lowercase the rest.

Running a test compare on this and the old one suggests it covers the same ground:

writeOutput('<table border="1"><thead><tr><th>Value</th><th>Original function</th><th>Revised function</th></tr></thead><tbody>');
for (name in [
    "cameron",            // control
    "CAMERON",            // control
    "Cameron",            // control
    "Oswald",            // control
    "oswald",            // control
    "OSWALD",            // control
    "McGarvie",            // already OK
    "MacDonald",        // already OK
    "O'Shea",            // already OK
    "Mcgarvie",            // should change
    "Macdonald",        // should change
    "O'shea",            // should change
    "mcgarvie",            // should change
    "macdonald",        // should change
    "o'shea",            // should change
    "MCGARVIE",            // should change
    "MACDONALD",        // should change
    "O'SHEA"            // should change

]){
    writeOutput("<tr><td>#name#</td><td>#celticMcCaps(name)#</td><td>#celticMcCapsRevised(name)#</td></tr>");
}
writeOutput("</tbody></table>");

(I'm in a rush today, so didn't bother with TDD... oops!)

This outputs:

Value	Original function	Revised function
cameron	cameron	cameron
CAMERON	CAMERON	CAMERON
Cameron	Cameron	Cameron
Oswald	Oswald	Oswald
oswald	oswald	oswald
OSWALD	OSWALD	OSWALD
McGarvie	McGarvie	McGarvie
MacDonald	MacDonald	MacDonald
O'Shea	O'Shea	O'Shea
Mcgarvie	McGarvie	McGarvie
Macdonald	MacDonald	MacDonald
O'shea	O'Shea	O'Shea
mcgarvie	McGarvie	McGarvie
macdonald	MacDonald	MacDonald
o'shea	O'Shea	O'Shea
MCGARVIE	McGarvie	McGarvie
MACDONALD	MacDonald	MacDonald
O'SHEA	O'Shea	O'Shea

All good?

This just demonstrates that when one is manipulating text... using regular expressions is probably the place to start, before writing a bunch of string-manipulation logic.

And also - from a TDD perspective - this would cut down the number of tests from four (one for each branch of the logic) to one. Obviously I'd still run an "eyeball" test like the one I wrote above.

Anyway... that's it. Unless anyone spots any shortfalls in the revised approach, I might update the UDF on CFLib.

I'm hoping Peter Boughton reads this and sets me straight about any dodginess in my regex. If you think Ben Nadel knows a thing or two about regular expressions (and, hey, he does), then he seems like a journeyman compared to Peter, who is a true regex guru.

--
Adam