Thursday 26 December 2013

Identifying when a regex pattern matches at position zero in a string

G'day:
Sean asked on Twitter today for input into how to best handle a shortfall in some TestBox functionality. The TestBox Jira ticket is this one (it also contains basically what I'm about to say here!): "toThrow() cannot match empty message & cannot match detail". The reason for the first stated problem is because of the way CFML's reFind() function works.

Consider this code:

// reFind.cfm
param name="URL.input" default="";
pattern = "^\d*$";
match = refind(pattern, URL.input);
writeDump(var=[{input=URL.input},{match=match}]);

The gist of this is that the reFind() call checks to see whether URL.input is a string which comprises (in its entirety) zero or more digits.

Here are the results with a few test inputs:

array
1
struct
INPUT123
2
struct
MATCH1

array
1
struct
INPUTabc
2
struct
MATCH0

array
1
struct
INPUT12c
2
struct
MATCH0

array
1
struct
INPUT1b3
2
struct
MATCH0

So this shows that it only lets solely-numeric values past: values like 1b3 return a zero match. OK, so far so good.

But hang on. The pattern was for zero or more digits. So zero digits is also legit here. Let's try giving an empty string (which is indeed zero digits) as the input:


array
1
struct
INPUT[empty string]
2
struct
MATCH0

This is correct. The match of zero digits occurs at position zero in that empty string. That's a match. However... reFind() returns 0 if no match was found. As it turns out, this was not sensible behaviour, because it means that there's no way to distinguish between no match, and a match at position zero. Oops.



Java is, unsurprisingly, a bit more clever about this than CFML is. Here I'll use the Java regex engine instead:

// matcher.cfm
param name="URL.input" default="";
pattern = createObject("java", "java.util.regex.Pattern").compile("^\d*$");
matcher = pattern.matcher(URL.input);
matches = matcher.matches();
try {
    start = matcher.start();
} catch (any e){
    start = "#e.message# #e.detail#";
}
writeDump(var=[{input=URL.input},{matches=matches},{start=start}]);    

And some test runs:

array
1
struct
INPUTabc
2
struct
MATCHESNO
3
struct
STARTNo match available

array
1
struct
INPUT[empty string]
2
struct
MATCHESYES
3
struct
START0

Java flat-out tells you if there was a match, with the boolean returned from matches(). Then one can use start() to correctly get where in the string the match started.

Writing all that code is typically Java long-winded-ness, but fortunately the String class short-circuits all this by providing its own matches() method:

// matches.cfm
param name="URL.input" default="";
pattern = "^\d*$";
match = URL.input.matches(pattern);
writeDump(var=[{input=URL.input},{match=match}]);

This just returns a boolean, which is all one needs sometimes in this "does it match?" requirement:

array
1
struct
INPUTabc
2
struct
MATCHNO

array
1
struct
INPUT[empty string]
2
struct
MATCHYES

Or, indeed, we could use a CFML-only equivalent using isValid(), which works fine here too:

// isValid.cfm
param name="URL.input" default="";
pattern = "^\d*$";
match = isValid("regex", URL.input, pattern);
writeDump(var=[{input=URL.input},{match=match}]);

I'll spare you the dumps. You get the picture.

reMatch() also will return a match of an empty string. So it's just some lack of fore-thought in reFind() which one needs to be aware of here.

I checked in PHP and Ruby to see how they handle the equivalent to reFind() on a zero-length match, and they both handle it fine:

<?php
// preg_grep.php
$input = $_GET["input"] ?: "";
$pattern = "/^\\d*$/";
$match = preg_match($pattern, $input);
echo "input: " . $input . "<br>";
echo "match: " . $match . "<br>";
?>

Various results:

input: 123
match: 1

input:
match: 1

input: abc
match: 0


# match.rb
input = ARGV[0] || ""
pattern = "^\\d*$"
match = input.match(pattern)
puts "input " + input
puts "match " + (match ? match.string : "match not defined")


C:\webroots\shared\git\blogExamples\misc\regex\zeroLength>ruby match.rb 123
input 123
match 123

C:\webroots\shared\git\blogExamples\misc\regex\zeroLength>ruby match.rb
input
match

C:\webroots\shared\git\blogExamples\misc\regex\zeroLength>ruby match.rb abc
input abc
match match not defined

C:\webroots\shared\git\blogExamples\misc\regex\zeroLength>

So that's perhaps slightly handy to know about. I have to say I enjoyed reading up on how to write the same code in PHP (grim) and Ruby (quite nice) more than the CFML side of the research. But the grass is always greener, eh?

Anyway, there you go.

--
Adam