Tuesday 15 October 2013

CFCamp: Quick unicode regex code snippet

G'day:
A question just came up in Kai's "Regular Expression Clinic" at CFCamp about unicode support in regular expressions. This is important in countries like Germany where they have a lot of non-ASCII characters in regular use in everyday words, eg: letters with umlauts above them ("ö"), or double-S characters ("ß"), etc. I didn't quite know the answer, so I had a look and came up with some code...

Firstly: ColdFusion's inbuilt regex engine is too old to support unicode, so that's not a starter.

However as of Java 7, there is unicode support in the Pattern class.

Here's some quick code that demonstrates this:

processingdirective pageencoding="UTF-8";
s = "löwenbräu";

asciiResult = s.matches("[a-z]{9}");
unicodeResult = s.matches("\p{IsAlphabetic}{9}");

writeDump([asciiResult,unicodeResult]);

Note that there are two non-ASCII characters in there. This code outputs:

Array
1
booleanfalse
2
booleantrue

(I've only got Railo running at the mo').

So the vanilla ASCII pattern is not matched, but the unicode-aware pattern does indeed work.

That's all I've got on this: I just wanted to post something to answer the question for the bloke in the audience. But if you're interested in the topic, I've got a bunch of articles tagged as "Regular Expressions": check 'em out.

--
Adam