Adam Cameron's Dev Blog: Regular expressions

Showing posts with label Regular expressions. Show all posts

Friday, 3 March 2023

TIL: something new about regex processing that made me feel dumb

G'day:

I like to think I'm reasonably confident with my regex usage, indeed have in the past written at length on regex implementation and usage in CFML (summarised here: "Regular expressions in CFML" link summary).

Today one of the denizens of the Working Code Podcast Discord channel - Sean Callahan - popped a question into the "Code Help" subchannel, and discussion ensued. The question was innocuous:

Why does this:
<cfscript>
    str = REReplaceNoCase("AZGRRBCZCIQITYD", ".*", "X", "ALL");
    WriteOutput(str);
</cfscript>
Return a single X? Testing on regexr.org gives me the matches that I would expect, which is any character except line breaks and matches all of them.

I came to the discussion a bit later as I was busy having lunch, drinking beer and reading "The Pragmatic Programmer" at the pub; but clarified a bit: the expectation was that it should return "XXXXXXXXXXXXXXX", not just "X". This is fine, he just needed to tweak his pattern a bit to be "." rather than ".*": one char at a time, not all the chars at once. No mystery there.

However before he clarified I saw he'd mentioned testing the pattern behaviour on https://regexr.com/, and that it behaved differently from CFML with the same pattern. I figured "yeah JS vs CFML, but still, should be the same…", so ran some code in my browser console to verify what he was seeing:

> "AZGRRBCZCIQITYD".replace(/.*/g,"X")
>- 'XX'

"Yeah see a single X… hang on WTF? Two Xs???"

I ran the equivalent code in CFML:

cf-cli>reReplace("AZGRRBCZCIQITYD", ".*", "X", "ALL")
X

Yeah that's what I expect. Now: my natural disposition is to assume CFML is doing something wrong when it differs from other systems, but I figured I should check elsewhere too.

php > echo preg_replace("/.*/", "X", "AZGRRBCZCIQITYD");
XX

Welcome to Node.js v18.14.0.
>  "AZGRRBCZCIQITYD".replace(/.*/g,"X")
'XX'

irb(main):001:0> "AZGRRBCZCIQITYD".gsub(/.*/, "X")
=> "XX"

(That's Ruby)

And back to the ColdFusion REPL to call Java's replaceAll method on that string:

cf-cli>s = "AZGRRBCZCIQITYD";
AZGRRBCZCIQITYD

cf-cli>s.replaceAll(".*", "X");
XX

Finally, thanks to Gavin's suggestion in the comments below, Perl:

Perl> my $s ="AZGRRBCZCIQITYD"
AZGRRBCZCIQITYD

Perl> $s =~ s/.*/X/g
2

Perl> print "$s\n"
XX
1

Perl is the same as the others.

OK so XX is clearly the correct answer, and ColdFusion (and Lucee, I hasten to add) are getting it wrong. But my expectations matches CFML's, so why am I wrong?

Note that if one took the global flag off, then JS worked as I'd expect:

>  "AZGRRBCZCIQITYD".replace(/.*/,"X")
'X'

So it's clearly doing a second iteration, and that's turning up another replacement. But: the whole string has already been replaced. So… erm?

The original regex matches zero-or-more characters. If I change the regex to match one-or-more (which is probably what Sean should have been using in the first place, had he wanted to replace everything with one X), then I get the result I'd expect:

>  "AZGRRBCZCIQITYD".replace(/.+/g,"X")
'X'

So it's not doing two iterations there.

Then I clocked what was going on. After the first iteration matches and replaces all of "AZGRRBCZCIQITYD" with "X", the second iteration in the initial example is… matching the residual empty string! This is why /.*/ matches a second time and, /.+/ doesn't.

This leaves me wondering how it's not still finding that empty string after the second and subsequent iterations though. I mean after matching the empty string the first time, there's still an empty string ready for the next time. And the time after that…

So I thought some more, and the way I've kinda explained it to myself is along these lines. A pseudocode algorithm:

For the original "AZGRRBCZCIQITYD":
Starts at 0;
Matches from 0-15;
Replaces with "X".
Next iteration:
We're resuming at 15, which is different from 0, so do it again;
matches from 15-15;
replaces;
15 is the same as 15 so we're done here.
Exit.

I doubt it's that, but that's a reasonable layperson's read of the situation I think. And I'm kinda happy that I worked through this exercise. All whilst having had three pints, btw ;-)

Righto.

--
Adam

Tuesday, 1 December 2015

ColdFusion: I learn something about query of query

G'day:
Just a quick one (I'm supposed to be doing Clojure this morning, not CFML). Here's somethng I did not know about QoQ in CFML. Well: in ColdFusion's implementation of QoQ. It's LIKE statement supports (very limited) regex patterns in its value.

Here's an example:

colours = queryNew("id,en,mi", "integer,varchar,varchar", [
    [1,"red","whero"],
    [2,"orange","karaka"],
    [3,"yellow","kowhai"],
    [4,"green","kakariki"],
    [5,"blue","kikorangi"],
    [6,"indigo","poropango"],
    [10,"violet","papura"]
]);

coloursWithOorU = queryExecute(
    "SELECT * FROM colours WHERE mi LIKE :pattern",
    {pattern={value="%[ou]%"}},
    {dbtype="query"}
);

writeDump(var=coloursWithOorU, format="text", metainfo=false);

And the result:

query

 
[Record # 1] 
en: red 
id: 1 
mi: whero
 
[Record # 2] 
en: yellow 
id: 3 
mi: kowhai
 
[Record # 3] 
en: blue 
id: 5 
mi: kikorangi
 
[Record # 4] 
en: indigo 
id: 6 
mi: poropango
 
[Record # 5] 
en: violet 
id: 10 
mi: papura

Cool. Note this does not work on Lucee.

I dunno what the grammar of the patterns are, but it's not simply standard CFML regex patterning. For example Initially I tried to have a pattern which would match words of six letters or more (ie: .{6}), but that didn't work. I was gonna say "it'd be really grand if Adobe actually documented this stuff", but actually they have! It's right there on the "Query of Queries user guide" page. OK, so the grammar is very limited. Just to what I've shown, basically: single character classes. It doesn't even support repetition modifiers. So it's a bit disappointing that the grammar is so limited, but it's handy nevertheless.

Thanks to Tim Brenner on the CFML Slack Channel for bringing this to my attention!

--
Adam

Saturday, 24 January 2015

"Regular expressions in CFML" link summary

G'day:
This is not a very interesting article. I just need a list of links to other articles for my book.

I have just reminded myself I still need to write that last section! Oops.

So, yeah, nothing new here. Sorry.

--
Adam

Thursday, 10 July 2014

Regex help please

G'day:
I'm hoping Peter Boughton or Ben Nadel might see this. Or someone else who is good @ regular expression patterns that I'm unaware of.

Here's the challenge...

Given this string:

Lorem ipsum dolor sit

I want to extract the leading sub-string which is:

no more than n characters long;
breaks at the previous whole word, rather than in the middle of a word;
if no complete single word matches, them matches at least the first word, even if the length of the sub-string is greater than n.

I've come up with this:

// trimToWord.cfm
string function trimToWord(required string string, required numeric index){
    return reReplace(string, "^((?:.{1,#index#}(?=\s|$)\b)|(?:.+?\b)).*", "\1", "ONE");
}

It works, but that regex is a bit hoary.

Here's a visual representation of it (courtesy of regexper.com), by way of explanation:

Anyone fancy improving it for me?

Here's some unit tests to run your suggestions through:

CFML: Regex to the rescue again

G'day:
Once again (prev: "Regex for simplifying string manipulation logic") I found myself able to slough off a coupla dozen lines of code in a CFLib.org UDF, by using a regex (well: two) in place of a bunch of looping and branching logic.

There's nothing new in this article, but a good real-world demonstration of where regexes can replace logic.

This UDF (titleCaseList()) got on my radar because someone mentioned a bug it had today, and looking at the comments, there was an earlier bug still outstanding. So I decided to quickly fix it. Here's the code for the previous version:

Regex for simplifying string manipulation logic

G'day:
An interesting blog article fell in front of me this morning: "Capitalization for us Mc’s and Mac’s!", by Brian McGarvie. It mentions a UDF on CFLib.org which handles... well as per his blog title: captialising his name as "McGarvie" rather than "Mcgarvie" like other capitalise() functions might do.

Identifying when a regex pattern matches at position zero in a string

G'day:
Sean asked on Twitter today for input into how to best handle a shortfall in some TestBox functionality. The TestBox Jira ticket is this one (it also contains basically what I'm about to say here!): "toThrow() cannot match empty message & cannot match detail". The reason for the first stated problem is because of the way CFML's reFind() function works.

Consider this code:

// reFind.cfm
param name="URL.input" default="";
pattern = "^\d*$";
match = refind(pattern, URL.input);
writeDump(var=[{input=URL.input},{match=match}]);

The gist of this is that the reFind() call checks to see whether URL.input is a string which comprises (in its entirety) zero or more digits.

Here are the results with a few test inputs:

array

struct
INPUT	123

struct
MATCH	1

array

struct
INPUT	abc

struct
MATCH	0

array

struct
INPUT	12c

struct
MATCH	0

array

struct
INPUT	1b3

struct
MATCH	0

So this shows that it only lets solely-numeric values past: values like 1b3 return a zero match. OK, so far so good.

But hang on. The pattern was for zero or more digits. So zero digits is also legit here. Let's try giving an empty string (which is indeed zero digits) as the input:

array

struct
INPUT	[empty string]

struct
MATCH	0

This is correct. The match of zero digits occurs at position zero in that empty string. That's a match. However... reFind() returns 0 if no match was found. As it turns out, this was not sensible behaviour, because it means that there's no way to distinguish between no match, and a match at position zero. Oops.

CFCamp: Quick unicode regex code snippet

G'day:
A question just came up in Kai's "Regular Expression Clinic" at CFCamp about unicode support in regular expressions. This is important in countries like Germany where they have a lot of non-ASCII characters in regular use in everyday words, eg: letters with umlauts above them ("ö"), or double-S characters ("ß"), etc. I didn't quite know the answer, so I had a look and came up with some code...

Cool Regex Tool!

G'day:
A real quick one. One of my cronies - Simon Baynes - just pointed me to this cool regex visualising tool. Here's a screen shot of how it analyses a regex:

Regular expressions in CFML (part 10: Java support for regular expressions (2/3))

G'day:

Mon 18 March 2013
Once again - for reasons better described over a few pints and the pub rather than in a blog article - I find myself winging southwards from London to Auckland. This time not for a holiday stint of a coupla weeks, but for over a month whilst I finalise some paperwork which requires me to be in NZ. Or more to the point: require me to be not in the UK. I left the safety of London today having read in the news this morning that Auckland had two earthquakes yesterday. Only baby ones, but still... I think NZers are a bit hesitant about their nation living up to its nickname of the Shaky Isles these days after Christchurch was flattened a coupla years ago. And Auckland doesn't traditionally have earthquakes, so to get any - even if small ones - is a bit concerning. In the back of my mind I am (melodramatically, and unnecessarily ~) concerned there's still something to land on when I get there. So, yeah... in the back of my mind I am hoping these earthquakes in Auckland stay around the 3 / 4 level on the MMS.

None of this has anything to do with regular expressions in Java, sorry.

Regular expressions in CFML (part 9: Java support for regular expressions (1/3))

G'day:
Jetlag and stupidity are a fine combination of things. I'm off to Ireland today to see my son, so am currently on the Tube heading to LHR. It's just after 6am, and not being a Saturday morning person, I general pass the time on this 1.5h journey (which I do every 2-3wks) by dozing and listening to a podcast. Two flaws with this plan are that I am still not readjusted to being back on GMT as my brain thinks I am still in NZ, so I am wide awake, and have been since 4am; secondly I (this is the stupidity part) have left my headphones at home. So... what to do? Well: continue this series of articles on regular expressions (which has been on hiatus, as you might have noticed). I guess it's good cos I'll get about 1h of productive typing during the journey.

Regular expressions in CFML (part 8: the rest of CFML's support for regular expressions)

G'day:

[Copy-and-paste-from-the-previous-article alert!]

The previous ~~six~~seven entries in this series discuss what a regular expression is:

And the syntax of the regex engine ColdFusion uses:

Now I have moved on to discuss the specifics of how CFML implements regular expression technology in the language:

reFind()

This article will cover the rest of the functions / tags in CFML which utilise regular expressions.

Regular expressions in CFML (part 7: `reFind()`)

G'day:
The previous six entries in this series discuss what a regular expression is:

And the syntax of the regex engine ColdFusion uses:

So far, I've not really mentioned CFML in any of my examples: I've just dealt with the regex syntax. In this article (update: and the subsequent few) I'll look at the CFML tags and functions which use regular expressions.

The most basic operation one can perform with a regular expression in CFML is to simply check to see if the regex matches anything in a string. This is done with reFind() and its case-insensitive counterpart reFindNoCase(). These functions work exactly the same except for the case-sensitivity, so I'll just dwell on reFind(). Similarly for reMatch() and reReplace() later on: their ~NoCase() counterparts work the same, so other than identifying they exist, I'll comment no further on them.

Apologies for my idiosyncratic way of lower-camel-casing the "re" bit on the functions... I cannot bring myself to write a function with an initial capital letter. I know it's a bit odd.

Basics of `reFind()`

reFind() (see even at the beginning of a sentence... no cap... ;-) works in one of two ways. The most basic is just to work like find() does: return the index of where a match was found:

idx = reFind("\d", "ABC123");

This returns 4 (the position the first digit (\d) - "1" - is at). This operation is not much use beyond that, to be honest. One cannot reliably tell what was matched, because obviously it's a pattern being matched, but also because the length of a match cannot necessarily be inferred from the pattern, so one can't necessarily use mid() to extract the match like one might be able to with a straight string find (one knows the length of the substring being sought, after all). For example, there's no way to infer from a match of "\d*" whether the match was one digit or ten digits. Or indeed zero digits. So it's only good for a throwaway "was it there?" check.

Note that it returns 0 if there was no match (even a zero-width match before the first character of the string results in "1" being returned).

Sub-expressions

Fortunately there's another argument reFind() can take to make it actually useful.

match = reFind("\d{2}", "ABC123", 1, true);

This time, we get back more info:

struct

LEN

array
1	2

POS

array
1	4

We get back a struct with keys pos and len, and each of those is an array. The values in the array enable us to extract what exactly was matched, eg:

s = "ABC123";
matches = reFind("\d{2}", s, 1, true);

match = mid(s, matches.pos[1], matches.len[1]);

This results in match having a value of "12": this was the two-digit substring matched by "\d{2}".

The third argument to reFind() is the position in which to start the match (same as the equivalent for find()), and it's not actually relevant here, but because CFML only supports named arguments on a few functions (I'm not sure why some do and some don't?), we need to specify it so as to be able to also specify the argument we're interested in: the true. The fourth argument is a flag as to whether to return subexpression info, and using true means one gets back that struct rather than just the integer index.

On initial take, one might wonder why the pos and len values are arrays, when there's only one element in the array. Well cast your mind back to the discussion on sub-expressions: a sub-expression's match can be "remembered"... remembered for both a back-reference later in the pattern or in a replacement pattern, but also it can be returned from a find operation too. Here's an example.

numbers = "one two three four five six seven eight nine ten";
regex = "\b\w([aeiou]{2})\w\b";

matches = reFind(regex, numbers, 1, true);
match = mid(numbers, matches.pos[1], matches.len[1]);
subexpression = mid(numbers, matches.pos[2], matches.len[2]);

writeDump(variables);

(That pattern matches a four letter word with two vowels in the middle)

struct

MATCH

four

MATCHES

struct

LEN

array
1	4
2	2

POS

array
1	15
2	16

NUMBERS

one two three four five six seven eight nine ten

REGEX

\b\w([aeiou]{2})\w\b

SUBEXPRESSION

See that the arrays now have two elements:

the first element is the overall match;
the second element is the match of the sub-expression.

There will be as many elements in the array as there are sub-expressions (plus the first one which is the entire match).

One thing to bear in mind with the "there will be as many elements in the array as there are sub-expressions" is that there will be an element in the array even if the sub-expression could possibly not be matched. Here's an example:

skuPattern = "^(\d{5})([A-Z]{2})?$";

skuBasic = "12345";
matches = reFind(skuPattern, skuBasic, 1, true);
writeDump(matches);

writeOutput("<hr />");

skuVariant = "12345AB";
matches = reFind(skuPattern, skuVariant, 1, true);
writeDump(matches);

Here an SKU can take one of two patterns: nnnnn or nnnnnxx where n is a digit and x is a letter. The pattern matches both of those, and in the case of the nnnnxx version, it also grabs the xx as a sub-expression. We can see this in the results:

struct

LEN

array
1	5
2	5
3	0

POS

array
1	1
2	1
3	0

struct

LEN

array
1	7
2	5
3	2

POS

array
1	1
2	1
3	6

Note how for the "basic" SKU, there's still an array element for the xx part of the pattern, even though it wasn't matched. So bear in mind to check that the pos value is >0 before trying to extract a substring for the sub-expression. Otherwise you'll get this (which will become a very familiar error for you):

The 2 parameter of the Mid function, which is now 0, must be a non-negative integer


The error occurred in D:\websites\www.scribble.local\junk\junk.cfm: line 8
6 : writeDump(matches); 7 : 8 : variant = mid(skuBasic, matches.pos[3], matches.len[3]);

(nice pidgin English from Adobe there with the error message btw: "the 2 parameter"??)

A variation on this is that a match can be made, but it can be zero length. One needs to be mindful of this too. Consider this example, which shows all four of the possibilities I mentioned:

struct function reFindMatches(required string regex, required string string){
    var result = reFind(regex, string, 1, true);

    result._match = [];
    for (var i=1; i <= arrayLen(result.pos); i++){
        if (result.pos[i] == 0){
            arrayAppend(result._match, "NO MATCH");
        }else if (result.len[i] == 0){
            arrayAppend(result._match, "ZERO-LENGTH MATCH");
        }else{
            arrayAppend(result._match, mid(string, result.pos[i], result.len[i]));
        }
    }
    return result;
}


skuPattern = "^([A-Z]{3})(?:-(?:DEFAULT|(\d*)))?$";

skuVariant = "ABC";
matches = reFindMatches(skuPattern, skuVariant);
writeDump(var=matches, label=skuVariant);

writeOutput("<hr />");

skuVariant = "ABC-DEFAULT";
matches = reFindMatches(skuPattern, skuVariant);
writeDump(var=matches, label=skuVariant);

writeOutput("<hr />");

skuVariant = "ABC-123";
matches = reFindMatches(skuPattern, skuVariant);
writeDump(var=matches, label=skuVariant);

writeOutput("<hr />");

skuVariant = "ABC-";
matches = reFindMatches(skuPattern, skuVariant);
writeDump(var=matches, label=skuVariant);

Here I've added a function reFindMatches() that improves reFind() in that as well as returning the pos and len of what was matched, it also returns what actually was matched. Which it ought to do off the bat, as all one would ever want pos / len for was to extract the matched substring. Indeed I have raised an E/R to augment reFind() and deprecate the mostly useless reMatch(): 3321666.

I've also used a pattern that matches four variations of SKU:

Just the basic "three letters" version;
Three letters with the "default" variant;
Three letters with a custom numeric variant;
Three letters with a zero-length numeric variant (this one stretches credulity, I know, sorry).

Here's the results:

123 - struct

LEN

123 - array
1	0

POS

123 - array
1	0

_MATCH

123 - array
1	NO MATCH

ABC - struct

LEN

ABC - array
1	3
2	3
3	0

POS

ABC - array
1	1
2	1
3	0

_MATCH

ABC - array
1	ABC
2	ABC
3	NO MATCH

ABC-DEFAULT - struct

LEN

ABC-DEFAULT - array
1	11
2	3
3	0

POS

ABC-DEFAULT - array
1	1
2	1
3	0

_MATCH

ABC-DEFAULT - array
1	ABC-DEFAULT
2	ABC
3	NO MATCH

ABC-123 - struct

LEN

ABC-123 - array
1	7
2	3
3	3

POS

ABC-123 - array
1	1
2	1
3	5

_MATCH

ABC-123 - array
1	ABC-123
2	ABC
3	123

ABC- - struct

LEN

ABC- - array
1	4
2	3
3	0

POS

ABC- - array
1	1
2	1
3	5

_MATCH

ABC- - array
1	ABC-
2	ABC
3	ZERO-LENGTH MATCH

This demonstrates various permutations of matches:

zero pos and zero len in the first element: no match at all. Equiv to a 0 result from a standard find() operation.
Zero pos and len in one of the other elements: no match for that specific sub-expression, but over-all there was still a match. This implies the subexpression was optional.
A positive value for both pos and len: there was a substring match.
A positive value for pos, but a zero length. There was a match, but it was zero-length.

Hopefully that range of unduly repetitious examples demonstrates the types of matches (or lack thereof) one can expect from reFind().

Starting position

The other argument I glossed over was the "starting position" one - the third argument - there's not much to say about this other than to be aware of it. It works the same as with find(). One thing to note, and kind of demonstrates why not using the fourth argument as true with reFind() isn't very useful, is that if one wants to cycle through all the matches in a string, the normal approach is to do this sort of thing:

// pseudo code
set findStartPos = 1
while (findOperation matches something){
    process the match
    set findStartPos = after last char of this match
}

This step is not really possible with a lot of patterns (ones that don't necessarily match a fixed length substring), unless one returns sub-expressions. Without the sub-expression data being returned, one has only the start position of the match, but one doesn't know how long it is, so one can't skip past it.

`reFindNoCase()`

This function is kinda redundant, given one can simply specify (?i) in the regex to make it case-insensitive anyhow.

Blimey. I was hoping to cover all the CFML regex functionality in one article, but I'm up to 1800 words on just reFind(). So I'm gonna stop here, before you nod off (if you haven't already), and continue with reReplace() in the next article.

Until the next time...

--
Scheherazade

Thursday, 3 January 2013

Regular expressions in CFML (part 6: syntax - flags and the odds 'n' sods that are left )

G'day:
This is part six of the series I started with the introduction article: "Regular Expressions in ColdFusion (part 1: overview)", and followed with a discussion entitled "Regular expressions in ColdFusion (part 2: concepts)". Then I moved onto syntax with:
"Regular expressions in ColdFusion (part 3: syntax - single characters)";
"Regular expressions in ColdFusion (part 4: syntax - repetition, sub-expressions and back-references)";
"Regular expressions in ColdFusion (part 5: syntax - look-arounds, and how the engine parses the string it's matching)".

Flags

There are a few different "modes" the regex engine can use when processing a string. These are summarised as follows:

Flag	Meaning
(?x)	This flags to ignore whitespace within the pattern, so one can split it across multiple lines and add comments (for clarity).
(?m)	This specifies the string being matched within should be treated as multi-line, so the ^ and $ anchors can be used to denote the beginning and end of a line. The \A and \Z characters still denote the beginning and end of the entire string.
(?i)	This specifies the pattern should be considered case-insensitve. This is somewhat redundant in CFML as there are separate functions for case-sensitive and case-insensitive operations. This still works though.

Regular expressions in CFML (part 5: syntax - look-arounds, and how the engine parses the string it's matching)

G'day:
This is part five of the series I started with the introduction article: "Regular Expressions in ColdFusion (part 1: overview)", and followed with a discussion entitled "Regular expressions in ColdFusion (part 2: concepts)". Then I moved onto syntax with "Regular expressions in ColdFusion (part 3: syntax - single characters)" and "Regular expressions in ColdFusion (part 4: syntax - repetition, sub-expressions and back-references)".

Please note that this article - more so than the other ones - does not stand-alone. It's more a "part 2" of the preceding article. I advise reading at least that one before reading this one. But, really, one should read the whole lot in order. But bring food and water with you before starting.

Look-arounds

Another type of sub-expression is a look-around. Look-arounds do what the name suggests: they look around to see if there's a match (or there specifically isn't a match). They can look ahead from the current position in the string being processed, or they can look behind. So we have four different sorts of look-around:

positive look-ahead;
negative look-ahead;
positive look-behind;
negative look-behind.

ColdFusion's regex engine only supports look-aheads (both positive and negative ones).

Before I explain how these work, let's back up a bit and examine what goes on when a regular expression pattern matching exercise takes place on a given string.

Regular expressions in CFML (part 4: syntax - repetition, sub-expressions and back-references)

G'day:
This is part four of the series I started with the introduction article: "Regular Expressions in ColdFusion (part 1: overview)", and followed with a discussion entitled "Regular expressions in ColdFusion (part 2: concepts)". Then I moved onto syntax with "Regular expressions in ColdFusion (part 3: syntax - single characters)".

Repetition (again)

In a previous article I listed the types of repetition one can express in a regex:

Zero times
Zero or one times
Zero our more times
One time
One our more times
An exact number of times
Between a specified minimum and maximum number of times
At least a minimum number of times

Regular expressions in CFML (part 3: syntax - single characters)

G'day:
This is part three of the series I started with the introduction article: "Regular Expressions in ColdFusion (part 1: overview)", and followed with a discussion entitled "Regular expressions in ColdFusion (part 2: concepts)".

Syntax

OK, so I'm now going to try to describe how all those concepts are actually reflected in regex syntax. This is where I am going to dispense with most of the keys on the keyboard, and just use the punctuation keys ;-)

(not really)

I'll go through each of those concepts in turn.

Regular expressions in CFML (part 2: concepts)

G'day:
This is part two of the series I started with the introduction article: "Regular Expressions in ColdFusion (part 1: overview)". Initially I set out to only write one article, but by the time it was over 8000 words long, I figured I should split it up and serialise it otherwise no-one would be brave enough to read the whole thing. If you see references in the text to "see above" or "see below", it might refer to something in one of the other articles. I'll try to find them all and replace with links, but no-doubt I'll miss some.

Concepts / components of a regular expression

A regular expression is built out of all those seemingly random sequences of punctuation characters. Taken en masse, that's quite impenetrable, and it's good to take a step back to consider the various notions / components / building blocks that are important to regexes. This is just a narrative / conceptual discussion, rather than delving into syntax, just for the moment. Knowing about the concepts is more important as the minutiae of the syntax, and once we have a handle on the concepts, the synactical vagaries will make more sense, and the expressions themselves will seem less daunting. In theory ;-)

Regular expressions in CFML (part 1: overview)

G'day:

Before I start
This started off being a single article, but it ended up way too long. And by the time I've come to be divvying it up, I'm thoroughly fed-up with it, so I'm not going to do the usual book-ending of each "sub-article", I'm just gonna fairly arbitrarily cut the thing into sections, and post each section as a new article. So it's gonna be best to start at the beginning and work through to the end, as the subsequent articles might not be fully contextualised in and of themselves. I'm also gonna serialise the thing over a few days, rather than releasing them all at once. Anyway... hold on to yer hat... regular expressions...

I've been mulling-over writing this article since I first started this blog. On one hand I'm fairly good with regular expressions, and a lot of ColdFusion developers (most of the ones I have encountered, anyhow) are not. So there's potential for a teaching exercise there. On the other hand Ben Nadel had banged-on so much about regexes so much in the past one might think there's little else left say on the matter.

About

I've been a web developer since 2001: 13yrs as a CFML developer; 6yrs as a PHP dev; and now back adjacent to the CFML community but focusing on code quality, design and testing. As of 2023, I am migrating my dev focus back to PHP again.

The code I write and discuss here is pretty much just looking at random conundrums I encounter in my day job.

I tend to be a bit "forthright" in my opinions, I am indelicate, and I tend to swear too much. This will come out occasionally here: I make no apology for it.

Everything said here is my own opinion. Feel free to disagree with me :-)

Friday, 3 March 2023

Tuesday, 1 December 2015

Saturday, 24 January 2015

Thursday, 10 July 2014

Thursday, 22 May 2014

Monday, 21 April 2014

Thursday, 26 December 2013

Tuesday, 15 October 2013

Saturday, 6 April 2013

Monday, 1 April 2013

Tuesday, 5 March 2013

Thursday, 17 January 2013

Thursday, 10 January 2013

Basics of reFind()

Sub-expressions

The 2 parameter of the Mid function, which is now 0, must be a non-negative integer

Starting position

reFindNoCase()

Thursday, 3 January 2013

Flags

Wednesday, 26 December 2012

Look-arounds

Monday, 24 December 2012

Repetition (again)

Saturday, 22 December 2012

Syntax

Thursday, 20 December 2012

Concepts / components of a regular expression

Tuesday, 18 December 2012

Basics of `reFind()`

`reFindNoCase()`