Tuesday 18 December 2012

Regular expressions in CFML (part 1: overview)

G'day:

Before I start
This started off being a single article, but it ended up way too long. And by the time I've come to be divvying it up, I'm thoroughly fed-up with it, so I'm not going to do the usual book-ending of each "sub-article", I'm just gonna fairly arbitrarily cut the thing into sections, and post each section as a new article. So it's gonna be best to start at the beginning and work through to the end, as the subsequent articles might not be fully contextualised in and of themselves. I'm also gonna serialise the thing over a few days, rather than releasing them all at once. Anyway... hold on to yer hat... regular expressions...

I've been mulling-over writing this article since I first started this blog. On one hand I'm fairly good with regular expressions, and a lot of ColdFusion developers (most of the ones I have encountered, anyhow) are not. So there's potential for a teaching exercise there. On the other hand Ben Nadel had banged-on so much about regexes so much in the past one might think there's little else left say on the matter.


The reason I made the decision to write the article anyway is that I just saw (ed: at time of writing. This was a coupla weeks ago now) a throw-away comment on Twitter, which got me thinking about it again. Carol Hamilton shared this popular regex quote:

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

- Jamie Zawinski
It's a cool quote, and neatly summarises a lot of people's position on regexes - those who aren't comfortable with them - or gets a knowing chuckle from those who are comfortable with them.

In that this quote is still being trotted-out, there it's clearly still some work to be done in getting people up to speed with them. Because they're seriously not as complicated as all that.

(I hasten to add I am drawing no conclusions as to Carol's comfort-levels with regular expressions: I do not know Carol and have never worked with her. She's just a name on Twitter: I follow her because she mentions she's a ColdFusion developer in her profile. I just mean that the quote does have relevance still).

First, how did I get comfortable with regexes? Back in my first ColdFusion job (about 2001), one of our lead developers used the plain replace() functions interchangeably with their regex counterparts, ie: sometimes they'd use replace(), sometimes they'd use reReplace(), seemingly at whim. Based on looking at their code, I could not tell the difference, and didn't know when one was supposed to use one, or when one was supposed to use the other. I was a ColdFusion beginner at the time, so wanted to understand this, so I made the conscious decision to find out what the difference between the two sets of functions were, and when to use the "regular" function, and when to use the other one. Like I said, I was new to ColdFusion, and had never heard of regular expressions before.

Aside: it turns out the lead dev had no idea either, and there was no good reason for them to be using them: they were equally unaware of regexes, and only ever used either function for standard string operations. Oops.

The first hurdle I had to learning about regexes was being confronted by something like this:

^/(?:[^/]+)/([^/]+)/([^/]+)/(.*)$

The only sensible reaction I can think of when seeing something like that for the first time is "what the... heck?" When I look at source code in a language I'm not too au fait with, I can quite often kinda get a vague idea of what's going on; but that regular expression above just looks like a mostly random sequence of punctuation marks, assembled in a way that will be meaningless to anyone who doesn't already know what it says. So how the hell is someone supposed to learn how all that works? I have a feeling that this is what happens to most people, and they think that learning regexes is gonna be akin to learning assembler. So they just go "nuh-uh" and look at their other screen where they can make a nice <cfpod> appear in the browser.

But... really... it's not actually that hard, and whilst it does take a while before having that <keanu>Whoa: I know regex</keanu> moment, one can get fairly proficient fairly quickly. All it takes is to not expect to know everything at once, and take small steps.

What are Regular Expressions all about?

First step. What's the whole idea behind regular expressions? This is a valid first question, because the name doesn't really mean much: we all know what an expression is, but what's "regular" about a regular expression? ("f***-all" in my opinion). If anything my reaction was precisely that: if that's a regular expression, I sure as hell don't want to see an irregular one.

Regular expressions are all about pattern matching. That's all they do. And the punctuation shenanigans is just a way of reflecting how to define a pattern to match.

What constitutes a pattern?

OK, but when does one want to match a pattern? Well: more often than one might think, and it all depends on what one perceives as a pattern: lots of textual things follow a pattern, or a scheme or some sort of generalised "approach". Take CFML tags as an example: all of them have a specific pattern. They all start with <cf and all end with >. And have stuff in between. So the pattern there is "starts with <cf, has some stuff, then ends with a >". That's a pattern. Obviously we can't expect a human-language description like that to be usable to signify that pattern to a computer, so regular expressions have a formalised syntax and grammar. A simple (but not thorough) solution to expressing a CFML tag with a regex might be:

<cf.*>

Regexes work kinda like how CFML does in that in a CFM file: there can be a mix of mark-up or other text, as well as CFML, and ColdFusion only "pays attention" to the CFML. A regex is much the same (after a fashion): in the above regular expression "<cf" means, literally, "<cf", as does the ">". The only regex-y bit is the ".*" (which means any single character of any description (".") zero or more times ("*"). Don't worry about the dots and the asterisks just yet, I'll get on to that in more detail below.

Back to what constitutes "a pattern". There's some less obvious things that do.
  • A sentence has a pattern: it starts with a capital letter, is followed by some characters, and ultimately ends with a full-stop. That's a pattern.
  • My favourite part of English vocabulary - four letter words - follow a pattern: a word that has four letters. Yes, I know that not all "four letter words" have four letters, and not all four-letter words are four letter words. The point being, "a word with four letters" is a pattern.
  • A file path is a pattern: using Windows as an example, it starts with a drive letter, has a colon, then is a series of directory names delimited by back-slashes, followed by a file name. That's a pattern.
  • Sticking with files... images. Web-browser-friendly image file names could be a pattern: any sequence of characters followed by a dot then one of PNG / JPG (or maybe jpeg?) / GIF / SVG. That's a pattern.
The point I'm trying to make here is that the notion of a pattern is more broad than just something like a UK post code, or a social-security number, or a GUID or something that obviously follows a prescribed format. So don't think regular expressions are basically only for string-format validation. They're a very powerful tool for general string processing / extraction / etc.

[to be continued...]

--
Adam