A question just came up in Kai's "Regular Expression Clinic" at CFCamp about unicode support in regular expressions. This is important in countries like Germany where they have a lot of non-ASCII characters in regular use in everyday words, eg: letters with umlauts above them ("ö"), or double-S characters ("ß"), etc. I didn't quite know the answer, so I had a look and came up with some code...
Firstly: ColdFusion's inbuilt regex engine is too old to support unicode, so that's not a starter.
However as of Java 7, there is unicode support in the Pattern class.
Here's some quick code that demonstrates this:
processingdirective pageencoding="UTF-8";
s = "löwenbräu";
asciiResult = s.matches("[a-z]{9}");
unicodeResult = s.matches("\p{IsAlphabetic}{9}");
writeDump([asciiResult,unicodeResult]);
Note that there are two non-ASCII characters in there. This code outputs:
Array | |||
1 |
| ||
2 |
|
(I've only got Railo running at the mo').
So the vanilla ASCII pattern is not matched, but the unicode-aware pattern does indeed work.
That's all I've got on this: I just wanted to post something to answer the question for the bloke in the audience. But if you're interested in the topic, I've got a bunch of articles tagged as "Regular Expressions": check 'em out.
--
Adam