Friday 3 March 2023

TIL: something new about regex processing that made me feel dumb

G'day:

I like to think I'm reasonably confident with my regex usage, indeed have in the past written at length on regex implementation and usage in CFML (summarised here: "Regular expressions in CFML" link summary).

Today one of the denizens of the Working Code Podcast Discord channel - Sean Callahan - popped a question into the "Code Help" subchannel, and discussion ensued. The question was innocuous:

Why does this:

<cfscript>
    str = REReplaceNoCase("AZGRRBCZCIQITYD", ".*", "X", "ALL");
    WriteOutput(str);
</cfscript>

Return a single X? Testing on regexr.org gives me the matches that I would expect, which is any character except line breaks and matches all of them.

I came to the discussion a bit later as I was busy having lunch, drinking beer and reading "The Pragmatic Programmer" at the pub; but clarified a bit: the expectation was that it should return "XXXXXXXXXXXXXXX", not just "X". This is fine, he just needed to tweak his pattern a bit to be "." rather than ".*": one char at a time, not all the chars at once. No mystery there.

However before he clarified I saw he'd mentioned testing the pattern behaviour on https://regexr.com/, and that it behaved differently from CFML with the same pattern. I figured "yeah JS vs CFML, but still, should be the same…", so ran some code in my browser console to verify what he was seeing:

> "AZGRRBCZCIQITYD".replace(/.*/g,"X")
>- 'XX'

"Yeah see a single X… hang on WTF? Two Xs???"

I ran the equivalent code in CFML:

cf-cli>reReplace("AZGRRBCZCIQITYD", ".*", "X", "ALL")
X

Yeah that's what I expect. Now: my natural disposition is to assume CFML is doing something wrong when it differs from other systems, but I figured I should check elsewhere too.

php > echo preg_replace("/.*/", "X", "AZGRRBCZCIQITYD");
XX
Welcome to Node.js v18.14.0.
>  "AZGRRBCZCIQITYD".replace(/.*/g,"X")
'XX'

irb(main):001:0> "AZGRRBCZCIQITYD".gsub(/.*/, "X")
=> "XX"

(That's Ruby)

And back to the ColdFusion REPL to call Java's replaceAll method on that string:

cf-cli>s = "AZGRRBCZCIQITYD";
AZGRRBCZCIQITYD

cf-cli>s.replaceAll(".*", "X");
XX

Finally, thanks to Gavin's suggestion in the comments below, Perl:

Perl> my $s ="AZGRRBCZCIQITYD"
AZGRRBCZCIQITYD

Perl> $s =~ s/.*/X/g
2

Perl> print "$s\n"
XX
1

Perl is the same as the others.

OK so XX is clearly the correct answer, and ColdFusion (and Lucee, I hasten to add) are getting it wrong. But my expectations matches CFML's, so why am I wrong?

Note that if one took the global flag off, then JS worked as I'd expect:

>  "AZGRRBCZCIQITYD".replace(/.*/,"X")
'X'

So it's clearly doing a second iteration, and that's turning up another replacement. But: the whole string has already been replaced. So… erm?

The original regex matches zero-or-more characters. If I change the regex to match one-or-more (which is probably what Sean should have been using in the first place, had he wanted to replace everything with one X), then I get the result I'd expect:

>  "AZGRRBCZCIQITYD".replace(/.+/g,"X")
'X'

So it's not doing two iterations there.

Then I clocked what was going on. After the first iteration matches and replaces all of "AZGRRBCZCIQITYD" with "X", the second iteration in the initial example is… matching the residual empty string! This is why /.*/ matches a second time and, /.+/ doesn't.

This leaves me wondering how it's not still finding that empty string after the second and subsequent iterations though. I mean after matching the empty string the first time, there's still an empty string ready for the next time. And the time after that…

So I thought some more, and the way I've kinda explained it to myself is along these lines. A pseudocode algorithm:

  • For the original "AZGRRBCZCIQITYD":
  • Starts at 0;
  • Matches from 0-15;
  • Replaces with "X".
  • Next iteration:
  • We're resuming at 15, which is different from 0, so do it again;
  • matches from 15-15;
  • replaces;
  • 15 is the same as 15 so we're done here.
  • Exit.

I doubt it's that, but that's a reasonable layperson's read of the situation I think. And I'm kinda happy that I worked through this exercise. All whilst having had three pints, btw ;-)

Righto.

--
Adam