Tuesday 5 March 2013

Regular expressions in CFML (part 9: Java support for regular expressions (1/3))

G'day:
Jetlag and stupidity are a fine combination of things. I'm off to Ireland today to see my son, so am currently on the Tube heading to LHR. It's just after 6am, and not being a Saturday morning person, I general pass the time on this 1.5h journey (which I do every 2-3wks) by dozing and listening to a podcast. Two flaws with this plan are that I am still not readjusted to being back on GMT as my brain thinks I am still in NZ, so I am wide awake, and have been since 4am; secondly I (this is the stupidity part) have left my headphones at home. So... what to do? Well: continue this series of articles on regular expressions (which has been on hiatus, as you might have noticed). I guess it's good cos I'll get about 1h of productive typing during the journey.


The previous eight entries in this series discuss what a regular expression is:
And the syntax of the regex engine ColdFusion uses:
Then I moved on to discuss the specifics of how CFML implements regular expression technology in the language:
So... in those eight articles I'd covered everything I can think of as far as ColdFusion's support for regular expressions goes, and the next significant consideration regarding regular expressions in CFML is that CF runs atop of Java, CFML strings are actually java.lang.Strings, and one can call/use Java classes directly in CFML code.  Which is very handy because it gives us a second regex engine to use in our code, and the Java regex engine is far better than the native CF one.

Firstly, I hope you already knew all that stuff I said about using Java directly in CFML? If not, here's a quick example:

numberList = "tahi,rua,toru,wha";
numberArray = numberList.split(",") ;     // split() is a method of java.lang.String
className = numberArray.getClass().getName();   // getClass() is a method of java.lang.Object, and getName() one of java.lang.Class
writeDump(variables );

// arrayDeleteAt(numberArray, 1); // errors because numberArray is a Java array (which is immutable), not a CF array

The output of this is:


struct
CLASSNAME[Ljava.lang.String;
NUMBERARRAY
array
1tahi
2rua
3toru
4wha
NUMBERLISTtahi,rua,toru,wha

Here we're using a number of Java method calls. Firstly we call java.lang.String's split() method, which is kinda like listToArray() in CFML, except it takes a regex rather than a set of chars to split the string into an array. We then use a coupla methods of java.lang.Object and java.lang.Class to extract the name of the class that numberArray is, and we see it's an array of strings. It's important to note that Java methods return Java data types, so whilst the dump here makes numberArray look like a CF array, it's actually not. A CF "array" is actually a java.util.Vector. We can still mostly use a Java array for CFML array operations thanks to CFML's loose typing, but one needs to be mindful that we're still constrainted to the actual data type. An example here is that one cannot use arrayDeleteAt() on a Java array, because Java arrays are immutable (one cannot change 'em).

Anyway, that was a bit of a digression.

java.lang.String


The most basic regex operations in Java can be performed directly on a Java String. There are a few methods which take a regular expression as an argument. They're all well documented in the Java docs, but I'll give a few quick examples of using them in CFML.

matches()

This checks to see whether the regex passed to matches() matches the string. I'd never used this method before, and assumed it worked like reFind() in that it'd return true if the regex pattern was matched anywhere in the string. This isn't quite right: the regex has to match the entire string. EG:

numbers = "tahi,rua,toru,wha";
hasTwo = numbers. matches(".*(?:two|rua).*" );
incorrect = numbers. matches("two|rua" );
writeDump(variables );

This outputs:


struct
HASTWOYES
INCORRECTNO
NUMBERStahi,rua,toru,wha

The "incorrect" line was my first attempt here, and was bamboozled until I actually RTFMed. Hopefully I've saved you some time there.

split()

You've already seen this one in my first example above. It's a quick way to turn a string into an array, split around a match of the passed-in regex pattern. Other than the example earlier, the other option here is to use the limit argument to limit the size of the resultant array, eg:

numberList = "tahi,rua,toru,wha";
numberArray = numberList.split(",", 3) ;
writeDump(variables);

Results in:

struct
NUMBERARRAY
array
1tahi
2rua
3toru,wha
NUMBERLISTtahi,rua,toru,wha

Note there are only three elements in the resultant array, with the last element containing everything that was left in the string. The merits of this are dubious to me, giving it only 30sec thought (ie: I cannot see why one would want to do this). Still: there it is.

replaceFirst() / replaceAll()

These two are pretty self-explanatory, but here's an example:


numberList = "tahi,rua,toru,wha";
first = numberList.replaceFirst( "rua|wha", "-" );
all = numberList.replaceAll( "rua|wha", "-" );
writeDump(variables );

Result:

struct
ALLtahi,-,toru,-
FIRSTtahi,-,toru,wha
NUMBERLISTtahi,rua,toru,wha

Note how first just has the first match of rua|wha removed; all has both of them removed.

java.util.regex.Pattern


The main support for regular expression operations in Java are in the java.util.regex.* package. The first relevant class is the Patern class, which represents a compiled regular expression. The docs for this class are thorough, so I'm not going to repeat them here, I'm just going to show general usage.

The first thing I'll observe - and this will possibly betray my reasonable level of ignorance when it comes to Java - is that the class doesn't have any constructors, instead on just calls the compile method on it which returns a Pattern object. I would have thought the way of creating a Pattern object would have been like this:

myPattern = createObject("java", "java.util.regex.Pattern").init(myRegex);

But actually it's this:

myPattern = createObject("java", "java.util.regex.Pattern").compile(myRegex);

I'm sure - like I said - it's my ignorance of Java at play here, but I'd like to know what the reason for that approach is if anyone knows.

compile()

As I noted above, this is how one gets one's regex string and turns it into something one can use in Java to perform regex operations with. In CFML a regex is always just a string, and whatever compilation needs doing gets done under the hood for us (which is CF's way: hide the technical fiddly stuff as much as possible). Now I wonder, though, what the overhead of doing this compilation is? I presume Java does this compile process instead of just using the string directly for a reason: I mean bear in mind that there are methods (even in the Pattern class, see below) which still just use a string directly. That might be worth testing out. If there is a significant overhead, it might not be so good the way ColdFusion handles regex patterns.

Anyway, compile() has two overloaded methods: the one I demonstrate above, and another one (linked to in the subheading above) which also takes an integer representing various optional flags which alter the way the ensuing pattern will work. The flags here are like the ones CF has, as mentioned in my earlier article: stuff like setting the pattern to be case-insensitive, or multi-line, or permitting comments, etc. There's a list of flags in the docs linked to above, I'll not repeat them here.

To use these flags, one adds them together, like this:

ones = "
tahi    ## maori
|
one        ## english
";

pattern = createObject("java", "java.util.regex.Pattern");

onesPattern = pattern.compile(ones, pattern.CASE_INSENSITIVE + pattern.COMMENTS);

onesMatcher = onesPattern.matcher("ONE=tahi");

while (onesMatcher.find()){
    writeOutput(onesMatcher.group() & "<br>");
}

Don't worry so much about the rest of the code: I'll get to all that. I just figured a complete example would contextualise things better.

The flags can be added together like that as they all have values that are exponents of two, so the sum of them can be bit-ORed to extract each flag separately again, eg:

pattern = createObject("java", "java.util.regex.Pattern");
flags = [];
for (flag in ["UNIX_LINES", "CASE_INSENSITIVE", "COMMENTS", "MULTILINE", "LITERAL", "DOTALL", "UNICODE_CASE", "CANON_EQ"]){
    arrayAppend(flags, {flag=flag, value=pattern[flag]});
}
writeDump(flags);

output:

array
1
struct
FLAGUNIX_LINES
VALUE1
2
struct
FLAGCASE_INSENSITIVE
VALUE2
3
struct
FLAGCOMMENTS
VALUE4
4
struct
FLAGMULTILINE
VALUE8
5
struct
FLAGLITERAL
VALUE16
6
struct
FLAGDOTALL
VALUE32
7
struct
FLAGUNICODE_CASE
VALUE64
8
struct
FLAGCANON_EQ
VALUE128

(I haven't included the UNICODE_CHARACTER_CLASS flag in this example as it's new to Java 1.7, and I only have 1.6 running on my CF instance here. I presume its value is 256).

And the compile() method returns a Pattern, as alluded to above.

flags()

Speaking of flags, the flags method returns an int of which flags are set, eg:

ones = "
tahi    ## maori
|
one        ## english
";

pattern = createObject("java", "java.util.regex.Pattern");

onesPattern = pattern.compile(ones, pattern.CASE_INSENSITIVE + pattern.COMMENTS);

flags = onesPattern.flags();

has = {};

for (flag in ["UNIX_LINES", "CASE_INSENSITIVE", "COMMENTS", "MULTILINE", "LITERAL", "DOTALL", "UNICODE_CASE", "CANON_EQ"]){
    has[flag] = yesNoFormat(bitAnd(flags, pattern[flag]));
}
writeOutput("flags() value: #flags#<br />");
writeDump(has);

Returns:

flags() value: 6
struct
CANON_EQNo
CASE_INSENSITIVEYes
COMMENTSYes
DOTALLNo
LITERALNo
MULTILINENo
UNICODE_CASENo
UNIX_LINESNo


pattern() / toString() / quote()


The reverse of compile() is pattern(). It returns the original regex string you passed-in to compile():

ones = "
tahi    ## maori
|
one        ## english
";

pattern = createObject("java", "java.util.regex.Pattern");

onesPattern = pattern.compile(ones, pattern.CASE_INSENSITIVE + pattern.COMMENTS);

regexPattern = onesPattern.pattern();

writeOutput('<pre style="tab-size:4">#regexPattern#</pre>');
writeOutput("<hr>");

regexString = onesPattern.toString();

writeOutput('<pre style="tab-size:4">#regexString#</pre>');
writeOutput("<hr>");

regexQuote = pattern.quote(ones);

writeOutput('<pre style="tab-size:4">#regexQuote#</pre>');
writeOutput("<hr>");

Pleasingly, it even preserves the indentation/line breaks:

tahi # maori
|
one # english

tahi # maori
|
one # english

\Q
tahi # maori
|
one # english
\E

Also note that toString() does exactly the same thing.

I've also included quote() in this example. Note it works slightly differently from pattern() and toString(). It doesn't work on the compiled pattern, but instead takes the regular expression string as an argument:


// pattern acts on the compiled regular expression
regexPattern = onesPattern.pattern();

// quote() takes the regex string as an argument
regexQuote = pattern.quote(ones);

Also note how easy it is to quote a regular expression using Java's engine: all one needs to do is wrap it in \Q and \E (I'll cover the syntactical differences between CF's and Java's regex dialects in a subsequent article). In CF, one needs to escape each individual element of the regex, or - since CF10 - use reEscape():

ones = "
tahi    ## maori
|
one        ## english
";
writeOutput('<pre style="tab-size:4">#reEscape(ones)#</pre>');

Result:
tahi # maori
\|
one # english

(that's to Ben K for helping me run that code... I don't have CF10 with me at the moment!)

split()

split() works much the same as the equivalent String methods I mentioned above; the difference being this acts on the Pattern and takes a String to match on, whereas the String version acts on a String and takes a Pattern to match. They're functionally equivalent other than that.

matcher()

Finally we've got the matcher() method. This returns a matcher(), which is the key to more complex regular expression operations such as returning each match from a string which has multiple matches (eg: like how reMatch() works in CFML, except working in a more useful way), doing replacements and the like. However there's a lot to cover on how a Matcher works, so that will be the topic for the next article. This one, I think we can all agree, is long enough already!

Stay tuned...

--
Adam