Monday 1 April 2013

Regular expressions in CFML (part 10: Java support for regular expressions (2/3))

G'day:

Mon 18 March 2013
Once again - for reasons better described over a few pints and the pub rather than in a blog article - I find myself winging southwards from London to Auckland. This time not for a holiday stint of a coupla weeks, but for over a month whilst I finalise some paperwork which requires me to be in NZ. Or more to the point: require me to be not in the UK. I left the safety of London today having read in the news this morning that Auckland had two earthquakes yesterday. Only baby ones, but still... I think NZers are a bit hesitant about their nation living up to its nickname of the Shaky Isles these days after Christchurch was flattened a coupla years ago. And Auckland doesn't traditionally have earthquakes, so to get any - even if small ones - is a bit concerning. In the back of my mind I am (melodramatically, and unnecessarily ~) concerned there's still something to land on when I get there. So, yeah... in the back of my mind I am hoping these earthquakes in Auckland stay around the 3 / 4 level on the MMS.

None of this has anything to do with regular expressions in Java, sorry.


Matcher objects

OK, so last time I covered Java's regex support via the String and the Pattern classes, and ended up with my version of a "cliffhanger", mentioning the Matcher class, but leaving it at that. So this article is about the Matcher class.

A Matcher is used to process multiple sequential matches within a string. And it has a bunch of methods which implement/effect a variation on that notion.

(Actually it's quite interesting writing this, as I currently have no idea what the Matcher class does, and I forgot to download the Java documentation before jumping on the plane, so I'm just working out what this thing does by trying random code, and inferring the meaning of the results. I'll go back and correct misconceptions and errors and stuff I could not work out before I press "publish", once I get to either Kuala Lumpur or Auckland (depending on how big a job I have...))

find()

The most basic example I can think of that demonstrates how a Matcher can work is via this code:

numberList = "tahi,rua,toru,wha";
regex = "\w+";
pattern = createObject("java", "java.util.regex.Pattern").compile(regex);
matcher = pattern.matcher(numberList);

while (matcher.find()){
    writeOutput(matcher.group() & "<br>");
} 

The Pattern here matches a word (eg: "tahi", "rua", etc), and the Matcher is used to look through the passed-in string (numberList), and extracts each match in turn, returning it, via the find() method.

The output is this:

tahi
rua
toru
wha

Note that the code here uses a loop to repeatedly call find(), and after there's no more to find, stops. Find returns a boolean: true if there was another match in the string, otherwise false. Whilst the loop condition - the find() - returns true, we then use group() to extract the substring matched by the pattern.

group() / group(int)

Superficially, calling the method that returns the matched substring "group()" might seem less than ideal, until one considers that it's also used to extract each subexpression match ("group", I guess) from the pattern. Consider this code, which is a fairly contrived variation of part of the previous code:


// the regex in this example is \w+,(\w+)

while (matcher.find()){
    writeOutput("Whole group: " & matcher.group() & "<br>");
    writeOutput("group(1): " & matcher.group(javaCast("int", 1)) & "<br>");
} 

(All the stuff before the loop is the same, I've omitted it purely to save space).

In this case, we have this output:

Whole group: tahi,rua
group(1): rua
Whole group: toru,wha
group(1): wha

We're finding two words, and specifically capturing the second one via a subexpression.


There's a coupla things of note here:
  1. I did not output group(0), although there is one. group(0) is the same as group(): the whole match.
  2. I have made a point of using javaCast() here to force the "1" to be an int. It is tragically sad that ColdFusion handles its data-typing so badly that it cannot even decide that 1 is an int, not a string. I get that CF is loosely-typed, but this ought to be implemented in such a way that if I say 1 then CF works out I mean it as a number (even if it got confused and decided it was a float rather than an int, that'd be better than deciding it's a string!), as distinct from "1" which is clearly a string. Equally loose-typefulness should work. If it decides 1 is a string, but then I use it where the code requires an int, it should try to convert it to an int before simply erroring. Because if I don't use javaCast(), CF errors here. Railo, I hasten to add, does not (error, I mean. It works fine). So it's not something intrinsically to do with how loose-typing works. CF simply bites here.
Andrew Scott picked up a potential for code to break here, if you shift from Java 6 to Java 7. In Java 6 the only group() method takes an int (Java 6 only supports numbered sub-expressions). In Java 7, one can also have named sub-expressions, so group() is overloaded to take a string as well.

This is similar to what one might do with a call to reFind(), and setting the "return subexpressions" argument to true, then looping over the returned arrays. Although this approach is a bit nicer, I think.

groupCount()

I already knew I had only one subexpression here, so could blithely call group(1). If I did not know how many groups there were, I can use the groupCount() method to find out how many groups I've got:

writeOutput("groupCount(): " & matcher.groupCount() & "<br>" );

For my example, this outputs 1.

If I adjust my regular expression slightly, I can see this more comprehensively, and with more robust code:

// the regex in this example is (\w+),(\w+)

while (matcher.find()){
    writeOutput("Whole group: " & matcher.group() & "<br>");
    writeOutput("groupCount(): " & matcher.groupCount() & "<br>" );
    for (i=1; i <= matcher.groupCount(); i++){
        writeOutput("group(#i#): " & matcher.group(i *1 ) & "<br>");
    }
}

I've worked around the javaCast() thing here by multiplying i by 1. Even ColdFusion can work out the result of a muliplication expression is a number, not a string. This outputs:

Whole group: tahi,rua
groupCount(): 2
group(1): tahi
group(2): rua
Whole group: toru,wha
groupCount(): 2
group(1): toru
group(2): wha

start() / end()

One can get the positions of the groups via the start() and end() methods:

numberList = "tahi,rua,toru,wha";

// show a table of where each character is at, to make the ensuing numbers easier to follow
charRow = "";
indexRow = "";
for (i=0; i < len(numberList); i++){
    charRow &= "<td>#mid(numberList, i+1, 1)#</td>";
    indexRow &= "<td>#i#</td>";
}
writeOutput('<table border="1" cellspacing="0"><tr>#charRow#</tr><tr>#indexRow#</tr></table>');


matcher = createObject("java", "java.util.regex.Pattern").compile("(\w+),(\w+)"). matcher(numberList);

while (matcher.find()){
    writeOutput("Whole group: #matcher.group()# (start: #matcher.start()#; end: #matcher.end()#)<br>");
    for (i=1; i <= matcher.groupCount(); i++){
        ndx = i * 1;
        writeOutput("Group #ndx#: #matcher.group(ndx)# (start: #matcher.start(ndx)#; end: #matcher.end(ndx)#)<br>");
    }
}

Here I've added in a table to display the index of each character in the string, plus revised some of the other code. I've rolled the call to Pattern into a single statement with chained methods which returns the Matcher without intermediary values. I've also abstracted the integer index into a separate variable ndx instead of repeating i*1 for each place I need it. The output is:


tahi,rua,toru,wha
012345678910111213141516
Whole group: tahi,rua (start: 0; end: 8)
Group 1: tahi (start: 0; end: 4)
Group 2: rua (start: 5; end: 8)
Whole group: toru,wha (start: 9; end: 17)
Group 1: toru (start: 9; end: 13)
Group 2: wha (start: 14; end: 17)

One thing I find curious in this is that start() returns the zero-based index of the first character of the group, whereas end() seems to return - depending on how one looks at it - either the one-based index of the last character in the group, or the zero-based character position at which the next group() operation might start. I guess the intent is the latter, here, as it seems unlikely that some results would use a zero-based index, and others one-based. Still: it would make more sense to me if end() returns the index of the end character. Kinda like how the method name suggests.

Also note that start() / end() work the same as group(): if one doesn't specify an argument, it's the start/end of the whole match; if an integer argument is used, then it's the specific group's start/end. Not shown here is that - also like group - one can use 0 as the argument for start()/end(), and this too references the entire match.

region() / regionStart() / regionEnd()

The region() method tells the Matcher where in the string it should be finding matches. This demonstrates it:

The regex for this example is (\w+)(?=,)

while (matcher.find()){
    writeOutput("Whole group: #matcher.group()# (start: #matcher.regionStart()#; end: #matcher.regionEnd()#)<br>");
}
writeOutput("<hr>");

matcher.region(5, 14);
while (matcher.find()){
    writeOutput("Whole group: #matcher.group()# (start: #matcher.regionStart()#; end: #matcher.regionEnd()#)<br>");
}

This time the regular expression pattern is looking for a sequence of word characters, looking ahead to see they're followed by a comma, but not actually capturing the comma in the returned match. The output is:


tahi,rua,toru,wha
012345678910111213141516

Whole group: tahi (start: 0; end: 17)
Whole group: rua (start: 0; end: 17)
Whole group: toru (start: 0; end: 17)

Whole group: rua (start: 5; end: 14)
Whole group: toru (start: 5; end: 14)

Note that when we set the region to lie between 5 and 14, the Matcher only finds rua and toru, because tahi and wha lie outside the specified region.

useTransparentBounds()

This is getting to the point at which I am needing to guess about exactly what's going on a bit. I've read all the docs for the Matcher class, but that was a coupla weeks ago, and I've never actually used any of this until writing this article, so it's definitely untrod ground here. It''ll be interesting to see how much of it I get right!

One thing that seems quite cool with these region boundaries is that they've got a sense of "transparency". Looking at the previous example, the Matcher only returned groups that had the entire pattern matched within the region. This sounds like stating the obvious, but consider this tweak to the region I used (this is using the same regular expression pattern as last time):

matcher.region(5, 13) ;

while (matcher.find()){
    writeOutput("Whole group: #matcher.group()# (start: #matcher.regionStart()#; end: #matcher.regionEnd()#)<br>") ;
}

All I've done is set the end of the boundary right after "toru", not including the comma. So predictably the look-ahead doesn't see the comma, and "toru" is not matched, so we only get this result:

Whole group: rua (start: 5; end: 13)

However really we were wanting words, and the look-ahead for the comma was just verifiying one was there, it wasn't ever going to be part of the group returned by the Matcher, so it could be argued that "toru" should be returned as well, if only the Matcher could "see" the comma for the look-ahead.

Well we can make the region boundaries "transparent", so the Matcher can see anything it needs to look-ahead for. Here's a further tweak to that code. Again, it's the same regex pattern:

matcher.region(5, 13) ;
matcher.useTransparentBounds(1);

while (matcher.find()){
    writeOutput("Whole group: #matcher.group()# (start: #matcher.regionStart()#; end: #matcher.regionEnd()#)<br>");
}

This outputs:

Whole group: rua (start: 5; end: 13)
Whole group: toru (start: 5; end: 13)

The Matcher could see the comma lying beyond the specified region, so it considers "toru" to be a match.

Jut to demonstrate this is not the same as extending the region, consider this variation:

matcher.region(5, 12) ;
matcher.useTransparentBounds(1) ;

while (matcher.find()){
    writeOutput("Whole group: #matcher.group()# (start: #matcher.regionStart()#; end: #matcher.regionEnd()#)<br>") ;
}

Here I've pulled the region back so that the "u" of "toru" is after the end of it, so  the look-ahead - looking for just a comma - won't match, and I'm back to only matching "rua":

Whole group: rua (start: 5; end: 12)

Conversely, I could adjust the pattern we're matching to be this: (\w+)(?=\w*,), and now the results would be like this:

Whole group: rua (start: 5; end: 12)
Whole group: tor (start: 5; end: 12)

Here we've adjusted the look-ahead to check for zero or more word characters followed by a comma, so we get a match thus:

tahi,rua,toru,wha

Where the yellow is the matched group (inside the region, which is the red text), and the green is the look-ahead (outside the region: black text).

reset()

Using the reset method, one can reset the pointer for where the next find() operation will start back to the beginning of the string. It also seems to clear any region that had previously been set too. Here's some code:

numberList = "tahi,rua,toru,wha";

matcher = createObject("java", "java.util.regex.Pattern").compile("\b\w+\b").matcher(numberList);

matcher.region(5, 13);

matcher.find();
writeOutput("Matched: #matcher.group()#<br>");
writeOutput("<hr>");

matcher.reset();

matcher.find();
writeOutput("Matched: #matcher.group()#<br>");

And this outputs:

Matched: rua

Matched: tahi

Initially for the first fnd(), we're looking within the region we'd set, so we find "rua". After the reset() we do another find(), and not only has the position to start finding at been reset, but the region is gone too. It might have been nice to only reset the start position within the region (leaving the region intact). Oh well, it's not like it's not easy to work around.

One can also reset the string being inspected for matches. For example if we tweak the code above to specify this reset() method call:

matcher.reset("one,two,three,four");

Then the output is:

Matched: rua

Matched: one

So not only has the index point to start find()-ing from been reset, and the region removed; we're now looking in a completely different string.

usePattern()

This method allows one to change the Pattern the Matcher is using, on the fly:

numberList = "tahi,rua,toru,wha";

Pattern = createObject("java", "java.util.regex.Pattern");

matcher = Pattern.compile("\b\w{3}\b" ).matcher(numberList);
matcher.find();
writeOutput("Matched: #matcher.group()#<br>") ;
writeOutput("<hr>") ;

matcher.usePattern(Pattern.compile( "\b\w{4}\b"));
matcher.find();
writeOutput("Matched: #matcher.group()#<br>");

The output:

Matched: rua

Matched: toru

Here I'm demonstrating finding a match of a three letter word ("rua"), and then changing the pattern we're using to find a four-letter word. Note that the find() operation continues from where it left off, but looked for the new Pattern.

useAnchoringBounds()

It looks like by default, with a region the ^ and $ boundary characters mean "the beginning and end off the region", as opposed to being of the whole string. One can switch this behaviour off.

numberList = "tahi,rua,toru,wha";

Pattern = createObject("java", "java.util.regex.Pattern");

matcher = Pattern. compile("^.*$").matcher(numberList);

matcher.region(5, 13);

matcher.find();
writeOutput("Matched: #matcher.group()#<br>");
writeOutput("<hr>");

matcher.reset();
matcher.region(5, 13);

matcher.useAnchoringBounds(0);
matcher.find();
writeOutput("Matched: #matcher.group()#<br>");

This time we're matching the entirety of the string, from start ("^") to finish ("$"). Output:

Matched: rua,toru



The web site you are accessing has experienced an unexpected error.
Please contact the website administrator.

The following information is meant for the website developer for debugging purposes.
Error Occurred While Processing Request

No match found

The error occurred in C:/ColdFusion10/cfusion/wwwroot/shared/blog/javaRegex/matcher_useAnchoringBounds.cfm: line 19
17 : matcher.useAnchoringBounds(0);
18 : matcher.find();
19 : writeOutput("Matched: #matcher.group()#<br>");


My code is a bit sloppy here in that I am not checking what find() returns, just assuming it'll find something. However having switched useAnchoringBounds() off, the ^ and $ characters no longer match the boundaries of the region, so there's no match. And, accordingly, the error when I try to reference a group that was never find()-ed (!).

hasTransparentBounds() / hasAnchoringBounds()

These two return the current state of their respective settings, eg:

matcher.useAnchoringBounds(0);
writeOutput("hasAnchoringBounds: #matcher.hasAnchoringBounds()#<br>");
matcher.useAnchoringBounds(1);
writeOutput("hasAnchoringBounds: #matcher.hasAnchoringBounds()#<br>");

matcher.useTransparentBounds(0);
writeOutput("hasTransparentBounds: #matcher.hasTransparentBounds()#<br>");
matcher.useTransparentBounds(1);
writeOutput("hasTransparentBounds: #matcher.hasTransparentBounds()#<br>");

Output:

hasAnchoringBounds: NO
hasAnchoringBounds: YES
hasTransparentBounds: NO
hasTransparentBounds: YES

There's not much else to say about this. Except I'm sure as hell those Java methods are not returning "YES" and "NO", they are returning true and false. I can abide by ColdFusion's poor decision of returning YES/NO for its own boolean functions, but it should not be messing with the return values from Java methods. Railo does the right thing here: the values bubbled back to CFML are true and false. As they were intended.

hitEnd()

This is just a check to see if there's anything more to find. Something I should have used in that code further up that errored. Here's a quick example:

numberList = "tahi,rua,toru,wha";

matcher = createObject("java", "java.util.regex.Pattern").compile("\w+").matcher(numberList);

while (!matcher.hitEnd()){
    matcher.find();
    writeOutput(matcher.group() & "<br>");
}

Output:

tahi
rua
toru
wha

You'll recall previously I'd been controlling the loop like this:

numberList = "tahi,rua,toru,wha";

matcher = createObject("java", "java.util.regex.Pattern").compile("\w+").matcher(numberList);

while (!matcher.hitEnd()){
    matcher.find();
    writeOutput(matcher.group() & "<br>");
}

I guess using hitEnd() is more clear in its intent, but I'd err towards just using find() in this sort of situation. Obviously there will be other situations in which one actually wants to know if one has hit the end of the matches without using find, and that's a better use-case for hitEnd().

toString()

toString() methods are usually pretty dull, but I was interested in what a string version of a Matcher would look like, so I had a look:

matcher = createObject("java", "java.util.regex.Pattern" ). compile("\w+" ). matcher("tahi,rua,toru,wha" );
writeOutput(matcher.toString());

And what we get is this:

java.util.regex.Matcher[pattern=\w+ region=0,17 lastmatch=]

So I guess that'd be quite useful if debugging. I'll tweak the code and do a find(), and see what we get for that "lastmatch":

matcher = createObject("java", "java.util.regex.Pattern" ).compile("\w+" ).matcher("tahi,rua,toru,wha");
matcher.find();
writeOutput(matcher.toString());

Result:

java.util.regex.Matcher[pattern=\w+ region=0,17 lastmatch=tahi]

Cool.

matches()

This one, I have to concede, I cannot work out without consulting the docs. It seems to just return false (well, OK, "NO" by the time ColdFusion finishes with it). I guess it returns true in some situations.

[I find that I actually have got the docs saved locally. Sigh]

OK, so this does a match of the entire pattern to the entire region, as demonstrated below:

Pattern = createObject("java", "java.util.regex.Pattern");
matcher = Pattern. compile(".*\w{3}.*" ).matcher("tahi,rua,toru,wha");
writeOutput("Matches a three-letter word: " & matcher.matches() & "<br>");

matcher.reset();
matcher.region(5, 13) ;      // rua,toru
matcher.usePattern(Pattern.compile( ".*\bt\w+\b.*"));
writeOutput("Matches a word starting with T: " & matcher.matches() & "<br>"); // yep: toru

matcher.reset();
matcher.region(5, 13) ;      // rua,toru
matcher.usePattern(Pattern.compile( ".*\bw\w+\b.*"));
writeOutput("Matches a word starting with W: " & matcher.matches() & "<br>" ); // nope: wha is outside the region

This outputs:

Matches a three-letter word: YES
Matches a word starting with T: YES
Matches a word starting with W: NO


In hindsight, this is the same as the equivalent String method.


Monday 1 April 2013
There was a bit fo a gap there (two weeks!). I've been tied up with immigration stuff, family stuff, and work stuff. Also transferring all the above from Evernote to a blog article seemed like something that would not be very fun, so I shelved it in favour of easier articles for a while. But it's all transferred now, so on with the show. I've got half-a-dozen or so more methods to talk about.

You know how I said I was learning a lot of this stuff as I went before, because I'd never used a Matcher before, and didn't even have the docs in front of me? Well the only change is that I now have the docs in front of me. I will need to read them before writing this lot up.

appendReplacement() / appendTail()

Bleah. I have to concede I had a coupla beers at lunchtime today, and perhaps that has impacted my ability to read and understand the Javadocs for appendReplacement(). This is what they say:

Implements a non-terminal append-and-replace step.
This method performs the following actions:

  1. It reads characters from the input sequence, starting at the append position, and appends them to the given string buffer. It stops after reading the last character preceding the previous match, that is, the character at index start() - 1.
  2. It appends the given replacement string to the string buffer.
  3. It sets the append position of this matcher to the index of the last character matched, plus one, that is, to end().
I don't know what the hell a "non-terminal append-and-replace step" is, so they kinda lost me right there. But I threw some code together (based on the example in the docs, but CFML-ified), and had a look at what happened:

source    = "[fish] tahi, [fish] rua, [fish] whero, [fish] kikorangi, [fish] mangu, [fish] kikorangi, [fish] tawhito, [fish] hou";
replace = "\[fish]";
with    = "ika";

matcher = createObject("java", "java.util.regex.Pattern" ).compile(replace).matcher(source);
result = createObject("java", "java.lang.StringBuffer").init();
writeOutput("<pre>");
writeOutput("Before:     " & source & "<br>");
while (matcher.find()){
    matcher.appendReplacement(result, with);
    writeOutput("In loop:    " & result.toString() & "<br>");
}
writeOutput("After loop: " & result.toString() & "<br>");
matcher.appendTail(result);

writeOutput("End:        " & result.toString() & "<br>");
writeOutput("</pre>");


And the output of this is:

Before:     [fish] tahi, [fish] rua, [fish] whero, [fish] kikorangi, [fish] mangu, [fish] kikorangi, [fish] tawhito, [fish] hou
In loop:    ika
In loop:    ika tahi, ika
In loop:    ika tahi, ika rua, ika
In loop:    ika tahi, ika rua, ika whero, ika
In loop:    ika tahi, ika rua, ika whero, ika kikorangi, ika
In loop:    ika tahi, ika rua, ika whero, ika kikorangi, ika mangu, ika
In loop:    ika tahi, ika rua, ika whero, ika kikorangi, ika mangu, ika kikorangi, ika
In loop:    ika tahi, ika rua, ika whero, ika kikorangi, ika mangu, ika kikorangi, ika tawhito, ika
After loop: ika tahi, ika rua, ika whero, ika kikorangi, ika mangu, ika kikorangi, ika tawhito, ika
End:        ika tahi, ika rua, ika whero, ika kikorangi, ika mangu, ika kikorangi, ika tawhito, ika hou
 
I stared at this for a while understanding the end result, but not really seeing what was going on to get there. But eventually I've kinda got it, and it's not exactly rocket science.

What happens is that the Matcher seems to track an append position (which is at the same place the next find() operation will start from; starting at the beginning of the string obviously), and when an appendReplacement() is made, everything from the current append point up to the beginning of the preceding find() operation is copied across to the StringBuffer, and then the replacement string is also appended to said StringBuffer. At this point the append point is shifted to after the end of the find() group, ready for the next find() / appendReplacement(). Or in short-form, whatever the find() operation parses when making a match... that stuff is put into the StringBuffer, and at the same time the bit that was find()-ed (!) is replaced with the replacement.

Here's the second find() / appendReplacement() operation from above, with some colour-coding to help:

Before:     [fish] tahi, [fish] rua, [fish] whero, [fish] kikorangi, [fish] mangu, [fish] kikorangi, [fish] tawhito, [fish] hou
In loop:    ika
In loop:    ika tahi, ika

The first operation swapped out the first "[fish]" for "ika". So at the end of that, both the point at which the next find() operation will start, and also the next appendReplacement() operation will take characters from is at the point at which the yellow starts.

The second find() operation will search through the yellow until it finds a match with [fish].

When appendReplacement() is called, it will append all the yellow bit to the StringBuffer, but instead of the matched [fish], it will append the replaced ika.

Then it's "rinse and repeat" whilst there are more things being find()-ed (sorry, you need to read that as "found" ;-).

However obviously (?) there's a bit of the string left in the original string which we've not copied across to the StringBuffer yet, and that's the sole purpose of appendTail(). All it does is append the bit from the end of the last find() through to the end of the string.

This all seems a bit crazy to me, but I guess it's the most "logical" way of doing an incremental replace operation.

One last note: yeah, it needs to be a StringBuffer. It won't work with a String (Java or CFML) nor a StringBuilder (this is even with Java 7, so it's odd it won't take a StringBuilder).

replaceFirst() / replaceAll()

These two are analogous to the difference between using the scope attribute on a CFML replace() operation with either "ONE" or "ALL":

source    = "[fish] tahi, [fish] rua, [fish] whero, [fish] kikorangi, [fish] mangu, [fish] kikorangi, [fish] tawhito, [fish] hou";
replace = "\[fish]";
with    = "ika";

matcher = createObject("java", "java.util.regex.Pattern" ).compile(replace).matcher(source);

writeOutput("replaceFirst(): #matcher.replaceFirst(with)#<br />");
writeOutput("replaceAll(): #matcher.replaceAll(with)#<br />");

Results in:

replaceFirst(): ika tahi, [fish] rua, [fish] whero, [fish] kikorangi, [fish] mangu, [fish] kikorangi, [fish] tawhito, [fish] hou
replaceAll(): ika tahi, ika rua, ika whero, ika kikorangi, ika mangu, ika kikorangi, ika tawhito, ika hou

All that's pretty self-explanatory, I reckon.

lookingAt()

This is a bit of a curious one, in my opinion. It'll return true if the pattern is matched at the beginning of the string:

Pattern = createObject("java", "java.util.regex.Pattern");
source    = "Takurua,Koanga,Raumati,Ngahuru";
writeOutput("Looking in '#source#'<br>");


lookFor = "Takurua";
matcher = Pattern.compile(lookFor).matcher(source);
writeOutput("Looking for '#lookFor#': " & matcher.lookingAt() & "<br>");


lookFor = "Koanga";
matcher.usePattern(Pattern.compile(lookFor));
writeOutput("Looking for '#lookFor#': " & matcher.lookingAt() & "<br>");

This outputs:

Looking in 'Takurua,Koanga,Raumati,Ngahuru'
Looking for 'Takurua': YES
Looking for 'Koanga': NO

So note the pattern doesn't have to match the entire string, but it does have to match from the beginning of the string. I dunno about the utility of this method. But there you go.

quoteReplacement()

This one threw for a bit, and I had to google for some examples to work out what it's on about. But it's fairly straight forward. You'll remember that one can escape a Pattern using the quote() method of the Pattern class? Well this is the equivalent for the replacement pattern. Remember that to reference a previously captured sub-expression, one uses the $, eg: $1. And to use a literal $ in the replacement pattern, one escapes it with a \. Well quoteReplacement() simply escapes those two characters (with a backslash).  EG:

replacementPattern = "The first sub-expression was: $1, and here's a literal dollar amount: $42.";
escapedPattern = createObject("java", "java.util.regex.Matcher" ).quoteReplacement(replacementPattern);

writeOutput("
replacementPattern: #replacementPattern#<br>
escapedPattern: #escapedPattern#<br>
");

Results in:

replacementPattern: The first sub-expression was: $1, and here's a literal dollar amount: $42.
escapedPattern: The first sub-expression was: \$1, and here's a literal dollar amount: \$42.

That's it.


requireEnd()

Another odd one, the need for which would never have occurred to me until I see it. Basically it returns a boolean which is true if the pattern needs to match at the end of the string; and false if there was a match found, but the match would not have been found had there been more to the string after the bit that was matched.  Err... an example might make that more clear:

Pattern = createObject("java", "java.util.regex.Pattern");

source    = "tahi";

match    = "^tahi$";
matcher = Pattern.compile(match).matcher(source);
matcher.find();
writeOutput("#match#: " & matcher.requireEnd() & "<br>");

match    = "^tahi";
matcher = Pattern.compile(match).matcher(source);
matcher.find();
writeOutput("#match#: " & matcher.requireEnd() & "<br>");

^tahi$: YES
^tahi: NO

The first one requires the source string to finish where it did (beause the pattern anchors to the end of the string with the $). The latter does not.

This also works within a region (which makes slightly more sense, I guess):

Pattern = createObject("java", "java.util.regex.Pattern");

source    = "aotearoa";
writeOutput("source: #source#<br>");

match    = "\w{3}\b";
matcher = Pattern.compile(match).matcher(source);
matcher.region(3,6);
matcher.find();
writeOutput("
match: #match#<br>
group(): #matcher.group()#<br>
requireEnd(): #matcher.requireEnd()#<br>
<hr>
");

match    = "\w{3}";
matcher = Pattern.compile(match).matcher(source);
matcher.region(3,6);
matcher.find();
writeOutput("
match: #match#<br>
group(): #matcher.group()#<br>
requireEnd(): #matcher.requireEnd()#<br>
");

Output:
source: aotearoa
match: \w{3}\b
group(): ear
requireEnd(): YES

match: \w{3}
group(): ear
requireEnd(): NO

So here you can see the first match only works because the end of the region counts as a word boundary (\b), so if there was more in the region, then the matcher would not find() anything. On the other hand the second one just matches three word characters, so would still match just fine if there was more to the input string. I guess this could be quite handy.

toMatchResult()

This method basically does a duplicate() (ie: the ColdFusion function duplicate()) on a Matcher object:

firstMatcher = createObject("java", "java.util.regex.Pattern").compile("\b\w+\b").matcher("tahi,rua,toru,wha");

firstMatcher.find();
writeOutput("firstMatcher group() after first find(): #firstMatcher.group()#<br>");

secondMatcher = firstMatcher.toMatchResult();
writeOutput("secondMatcher created here<br>");
firstMatcher.find();
writeOutput("firstMatcher group() after second find(): #firstMatcher.group()#<br>");
firstMatcher.find();
writeOutput("firstMatcher group() after third find(): #firstMatcher.group()#<br>");

writeOutput("<hr>");

secondMatcher.find();
writeOutput("secondMatcher group() after first find(): #secondMatcher.group()#<br>");
secondMatcher.find();
writeOutput("secondMatcher group() after second find(): #secondMatcher.group()#<br>");

Output:

firstMatcher group() after first find(): tahi
secondMatcher created here
firstMatcher group() after second find(): rua
firstMatcher group() after third find(): toru

secondMatcher group() after first find(): rua secondMatcher group() after second find(): toru
See how the secondMatcher gets created with the same "state" that the firstMatcher currently has. The firstMatcher then continues to match stuff, but when we come to start using the secondMatcher, it's unaffected by the activity performed on the firstMatcher subsequent to secondMatcher being created.

pattern()

And last but not least, this one just returns the Pattern object that the Match is matching on:

orginalPattern = createObject("java", "java.util.regex.Pattern").compile("\b\w+\b");
matcher = orginalPattern.matcher("tahi,rua,toru,wha");

returnedPattern = matcher.pattern();

writeOutput("Regex pattern returned from orginalPattern: " & orginalPattern.pattern() & "<br>");
writeOutput("Regex pattern returned from returnedPatther: " & returnedPattern.pattern() & "<br>");

Output:
Regex pattern returned from orginalPattern: \b\w+\b
Regex pattern returned from returnedPatther: \b\w+\b

I figured the easiest demonstration of the same pattern being returned as the one used to create the matcher in the first place was to output the regex.


And that - bloody hell - is that. That's all I have to say / investigate / document about Matchers. I have learned a lot about Java regular expressions whilst writing this article. Nice one! I hope you did too... and at the very least bravo for getting this far!

The third part of this  investigation into Java regular expressions goes back to the Pattern class, and I look at the differences in the actual regex patterns that Java has, compared to ColdFusion. Ther's a bunch of good stuff!

Stay tuned...

Update 2014-09-22:

Oh dear. I never got around to doing part 3, and I doubt I ever will now, having mostly moved-on from CFML, I'm afraid. Sorry 'bout that.

--
Adam