Showing posts with label Regular expressions. Show all posts
Showing posts with label Regular expressions. Show all posts

Tuesday, 1 December 2015

ColdFusion: I learn something about query of query

G'day:
Just a quick one (I'm supposed to be doing Clojure this morning, not CFML). Here's somethng I did not know about QoQ in CFML. Well: in ColdFusion's implementation of QoQ. It's LIKE statement supports (very limited) regex patterns in its value.

Here's an example:

colours = queryNew("id,en,mi", "integer,varchar,varchar", [
    [1,"red","whero"],
    [2,"orange","karaka"],
    [3,"yellow","kowhai"],
    [4,"green","kakariki"],
    [5,"blue","kikorangi"],
    [6,"indigo","poropango"],
    [10,"violet","papura"]
]);

coloursWithOorU = queryExecute(
    "SELECT * FROM colours WHERE mi LIKE :pattern",
    {pattern={value="%[ou]%"}},
    {dbtype="query"}
);

writeDump(var=coloursWithOorU, format="text", metainfo=false);

And the result:

query

 
[Record # 1] 
en: red 
id: 1 
mi: whero
 
[Record # 2] 
en: yellow 
id: 3 
mi: kowhai
 
[Record # 3] 
en: blue 
id: 5 
mi: kikorangi
 
[Record # 4] 
en: indigo 
id: 6 
mi: poropango
 
[Record # 5] 
en: violet 
id: 10 
mi: papura

Cool. Note this does not work on Lucee.

I dunno what the grammar of the patterns are, but it's not simply standard CFML regex patterning. For example Initially I tried to have a pattern which would match words of six letters or more (ie: .{6}), but that didn't work. I was gonna say "it'd be really grand if Adobe actually documented this stuff", but actually they have! It's right there on the "Query of Queries user guide" page. OK, so the grammar is very limited. Just to what I've shown, basically: single character classes. It doesn't even support repetition modifiers. So it's a bit disappointing that the grammar is so limited, but it's handy nevertheless.

Thanks to Tim Brenner on the CFML Slack Channel for bringing this to my attention!

--
Adam

Thursday, 10 July 2014

Regex help please

G'day:
I'm hoping Peter Boughton or Ben Nadel might see this. Or someone else who is good @ regular expression patterns that I'm unaware of.

Here's the challenge...



Given this string:

Lorem ipsum dolor sit

I want to extract the leading sub-string which is:
  • no more than n characters long;
  • breaks at the previous whole word, rather than in the middle of a word;
  • if no complete single word matches, them matches at least the first word, even if the length of the sub-string is greater than n.

I've come up with this:

// trimToWord.cfm
string function trimToWord(required string string, required numeric index){
    return reReplace(string, "^((?:.{1,#index#}(?=\s|$)\b)|(?:.+?\b)).*", "\1", "ONE");
}

It works, but that regex is a bit hoary.

Here's a visual representation of it (courtesy of regexper.com), by way of explanation:



Anyone fancy improving it for me?

Here's some unit tests to run your suggestions through:

Thursday, 22 May 2014

CFML: Regex to the rescue again

G'day:
Once again (prev: "Regex for simplifying string manipulation logic") I found myself able to slough off a coupla dozen lines of code in a CFLib.org UDF, by using a regex (well: two) in place of a bunch of looping and branching logic.

There's nothing new in this article, but a good real-world demonstration of where regexes can replace logic.


This UDF (titleCaseList()) got on my radar because someone mentioned a bug it had today, and looking at the comments, there was an earlier bug still outstanding. So I decided to quickly fix it. Here's the code for the previous version:

Monday, 21 April 2014

Regex for simplifying string manipulation logic

G'day:
An interesting blog article fell in front of me this morning: "Capitalization for us Mc’s and Mac’s!", by Brian McGarvie. It mentions a UDF on CFLib.org which handles... well as per his blog title: captialising his name as "McGarvie" rather than "Mcgarvie" like other capitalise() functions might do.

Thursday, 26 December 2013

Identifying when a regex pattern matches at position zero in a string

G'day:
Sean asked on Twitter today for input into how to best handle a shortfall in some TestBox functionality. The TestBox Jira ticket is this one (it also contains basically what I'm about to say here!): "toThrow() cannot match empty message & cannot match detail". The reason for the first stated problem is because of the way CFML's reFind() function works.

Consider this code:

// reFind.cfm
param name="URL.input" default="";
pattern = "^\d*$";
match = refind(pattern, URL.input);
writeDump(var=[{input=URL.input},{match=match}]);

The gist of this is that the reFind() call checks to see whether URL.input is a string which comprises (in its entirety) zero or more digits.

Here are the results with a few test inputs:

array
1
struct
INPUT123
2
struct
MATCH1

array
1
struct
INPUTabc
2
struct
MATCH0

array
1
struct
INPUT12c
2
struct
MATCH0

array
1
struct
INPUT1b3
2
struct
MATCH0

So this shows that it only lets solely-numeric values past: values like 1b3 return a zero match. OK, so far so good.

But hang on. The pattern was for zero or more digits. So zero digits is also legit here. Let's try giving an empty string (which is indeed zero digits) as the input:


array
1
struct
INPUT[empty string]
2
struct
MATCH0

This is correct. The match of zero digits occurs at position zero in that empty string. That's a match. However... reFind() returns 0 if no match was found. As it turns out, this was not sensible behaviour, because it means that there's no way to distinguish between no match, and a match at position zero. Oops.

Tuesday, 15 October 2013

CFCamp: Quick unicode regex code snippet

G'day:
A question just came up in Kai's "Regular Expression Clinic" at CFCamp about unicode support in regular expressions. This is important in countries like Germany where they have a lot of non-ASCII characters in regular use in everyday words, eg: letters with umlauts above them ("ö"), or double-S characters ("ß"), etc. I didn't quite know the answer, so I had a look and came up with some code...

Saturday, 6 April 2013

Cool Regex Tool!

G'day:
A real quick one. One of my cronies - Simon Baynes - just pointed me to this cool regex visualising tool. Here's a screen shot of how it analyses a regex:

Monday, 1 April 2013

Regular expressions in CFML (part 10: Java support for regular expressions (2/3))

G'day:

Mon 18 March 2013
Once again - for reasons better described over a few pints and the pub rather than in a blog article - I find myself winging southwards from London to Auckland. This time not for a holiday stint of a coupla weeks, but for over a month whilst I finalise some paperwork which requires me to be in NZ. Or more to the point: require me to be not in the UK. I left the safety of London today having read in the news this morning that Auckland had two earthquakes yesterday. Only baby ones, but still... I think NZers are a bit hesitant about their nation living up to its nickname of the Shaky Isles these days after Christchurch was flattened a coupla years ago. And Auckland doesn't traditionally have earthquakes, so to get any - even if small ones - is a bit concerning. In the back of my mind I am (melodramatically, and unnecessarily ~) concerned there's still something to land on when I get there. So, yeah... in the back of my mind I am hoping these earthquakes in Auckland stay around the 3 / 4 level on the MMS.

None of this has anything to do with regular expressions in Java, sorry.

Tuesday, 5 March 2013

Regular expressions in CFML (part 9: Java support for regular expressions (1/3))

G'day:
Jetlag and stupidity are a fine combination of things. I'm off to Ireland today to see my son, so am currently on the Tube heading to LHR. It's just after 6am, and not being a Saturday morning person, I general pass the time on this 1.5h journey (which I do every 2-3wks) by dozing and listening to a podcast. Two flaws with this plan are that I am still not readjusted to being back on GMT as my brain thinks I am still in NZ, so I am wide awake, and have been since 4am; secondly I (this is the stupidity part) have left my headphones at home. So... what to do? Well: continue this series of articles on regular expressions (which has been on hiatus, as you might have noticed). I guess it's good cos I'll get about 1h of productive typing during the journey.

Thursday, 17 January 2013

Regular expressions in CFML (part 8: the rest of CFML's support for regular expressions)

G'day:

[Copy-and-paste-from-the-previous-article alert!]

The previous sixseven entries in this series discuss what a regular expression is:
And the syntax of the regex engine ColdFusion uses:
Now I have moved on to discuss the specifics of how CFML implements regular expression technology in the language:
This article will cover the rest of the functions / tags in CFML which utilise regular expressions.

Thursday, 10 January 2013

Regular expressions in CFML (part 7: reFind())

G'day:
The previous six entries in this series discuss what a regular expression is:
And the syntax of the regex engine ColdFusion uses:
So far, I've not really mentioned CFML in any of my examples: I've just dealt with the regex syntax. In this article (update: and the subsequent few) I'll look at the CFML tags and functions which use regular expressions.

The most basic operation one can perform with a regular expression in CFML is to simply check to see if the regex matches anything in a string. This is done with reFind() and its case-insensitive counterpart reFindNoCase(). These functions work exactly the same except for the case-sensitivity, so I'll just dwell on reFind(). Similarly for reMatch() and reReplace() later on: their ~NoCase() counterparts work the same, so other than identifying they exist, I'll comment no further on them.

Apologies for my idiosyncratic way of lower-camel-casing the "re" bit on the functions... I cannot bring myself to write a function with an initial capital letter. I know it's a bit odd.

Basics of reFind()

reFind() (see even at the beginning of a sentence... no cap... ;-) works in one of two ways. The most basic is just to work like find() does: return the index of where a match was found:

idx = reFind("\d", "ABC123");

This returns 4 (the position the first digit (\d) - "1" - is at). This operation is not much use beyond that, to be honest. One cannot reliably tell what was matched, because obviously it's a pattern being matched, but also because the length of a match cannot necessarily be inferred from the pattern, so one can't necessarily use mid() to extract the match like one might be able to with a straight string find (one knows the length of the substring being sought, after all). For example, there's no way to infer from a match of "\d*" whether the match was one digit or ten digits. Or indeed zero digits. So it's only good for a throwaway "was it there?" check.

Note that it returns 0 if there was no match (even a zero-width match before the first character of the string results in "1" being returned).

Sub-expressions

Fortunately there's another argument reFind() can take to make it actually useful.

match = reFind("\d{2}", "ABC123", 1, true);

This time, we get back more info:

struct
LEN
array
12
POS
array
14

We get back a struct with keys pos and len, and each of those is an array. The values in the array enable us to extract what exactly was matched, eg:

s = "ABC123";
matches = reFind("\d{2}", s, 1, true);

match = mid(s, matches.pos[1], matches.len[1]);


This results in match having a value of "12": this was the two-digit substring matched by "\d{2}".

The third argument to reFind() is the position in which to start the match (same as the equivalent for find()), and it's not actually relevant here, but because CFML only supports named arguments on a few functions (I'm not sure why some do and some don't?), we need to specify it so as to be able to also specify the argument we're interested in: the true. The fourth argument is a flag as to whether to return subexpression info, and using true means one gets back that struct rather than just the integer index.

On initial take, one might wonder why the pos and len values are arrays, when there's only one element in the array. Well cast your mind back to the discussion on sub-expressions: a sub-expression's match can be "remembered"... remembered for both a back-reference later in the pattern or in a replacement pattern, but also it can be returned from a find operation too. Here's an example.

numbers = "one two three four five six seven eight nine ten";
regex = "\b\w([aeiou]{2})\w\b";

matches = reFind(regex, numbers, 1, true);
match = mid(numbers, matches.pos[1], matches.len[1]);
subexpression = mid(numbers, matches.pos[2], matches.len[2]);

writeDump(variables);

(That pattern matches a four letter word with two vowels in the middle)

struct
MATCHfour
MATCHES
struct
LEN
array
14
22
POS
array
115
216
NUMBERSone two three four five six seven eight nine ten
REGEX\b\w([aeiou]{2})\w\b
SUBEXPRESSIONou

See that the arrays now have two elements:
  • the first element is the overall match;
  • the second element is the match of the sub-expression.
There will be as many elements in the array as there are sub-expressions (plus the first one which is the entire match).

One thing to bear in mind with the "there will be as many elements in the array as there are sub-expressions" is that there will be an element in the array even if the sub-expression could possibly not be matched. Here's an example:

skuPattern = "^(\d{5})([A-Z]{2})?$";

skuBasic = "12345";
matches = reFind(skuPattern, skuBasic, 1, true);
writeDump(matches);

writeOutput("<hr />");

skuVariant = "12345AB";
matches = reFind(skuPattern, skuVariant, 1, true);
writeDump(matches);

Here an SKU can take one of two patterns: nnnnn or nnnnnxx where n is a digit and x is a letter. The pattern matches both of those, and in the case of the nnnnxx version, it also grabs the xx as a sub-expression. We can see this in the results:

struct
LEN
array
15
25
30
POS
array
11
21
30

struct
LEN
array
17
25
32
POS
array
11
21
36

Note how for the "basic" SKU, there's still an array element for the xx part of the pattern, even though it wasn't matched. So bear in mind to check that the pos value is >0 before trying to extract a substring for the sub-expression. Otherwise you'll get this (which will become a very familiar error for you):

The 2 parameter of the Mid function, which is now 0, must be a non-negative integer

The error occurred in D:\websites\www.scribble.local\junk\junk.cfm: line 8
6 : writeDump(matches);
7 : 
8 : variant = mid(skuBasic, matches.pos[3], matches.len[3]);

(nice pidgin English from Adobe there with the error message btw: "the 2 parameter"??)

A variation on this is that a match can be made, but it can be zero length. One needs to be mindful of this too. Consider this example, which shows all four of the possibilities I mentioned:

struct function reFindMatches(required string regex, required string string){
    var result = reFind(regex, string, 1, true);

    result._match = [];
    for (var i=1; i <= arrayLen(result.pos); i++){
        if (result.pos[i] == 0){
            arrayAppend(result._match, "NO MATCH");
        }else if (result.len[i] == 0){
            arrayAppend(result._match, "ZERO-LENGTH MATCH");
        }else{
            arrayAppend(result._match, mid(string, result.pos[i], result.len[i]));
        }
    }
    return result;
}


skuPattern = "^([A-Z]{3})(?:-(?:DEFAULT|(\d*)))?$";

skuVariant = "ABC";
matches = reFindMatches(skuPattern, skuVariant);
writeDump(var=matches, label=skuVariant);

writeOutput("<hr />");

skuVariant = "ABC-DEFAULT";
matches = reFindMatches(skuPattern, skuVariant);
writeDump(var=matches, label=skuVariant);

writeOutput("<hr />");

skuVariant = "ABC-123";
matches = reFindMatches(skuPattern, skuVariant);
writeDump(var=matches, label=skuVariant);

writeOutput("<hr />");

skuVariant = "ABC-";
matches = reFindMatches(skuPattern, skuVariant);
writeDump(var=matches, label=skuVariant);

Here I've added a function reFindMatches() that improves reFind() in that as well as returning the pos and len of what was matched, it also returns what actually was matched. Which it ought to do off the bat, as all one would ever want pos / len for was to extract the matched substring. Indeed I have raised an E/R to augment reFind() and deprecate the mostly useless reMatch(): 3321666.

I've also used a pattern that matches four variations of SKU:
  • Just the basic "three letters" version;
  • Three letters with the "default" variant;
  • Three letters with a custom numeric variant;
  • Three letters with a zero-length numeric variant (this one stretches credulity, I know, sorry).
Here's the results:

123 - struct
LEN
123 - array
10
POS
123 - array
10
_MATCH
123 - array
1NO MATCH

ABC - struct
LEN
ABC - array
13
23
30
POS
ABC - array
11
21
30
_MATCH
ABC - array
1ABC
2ABC
3NO MATCH

ABC-DEFAULT - struct
LEN
ABC-DEFAULT - array
111
23
30
POS
ABC-DEFAULT - array
11
21
30
_MATCH
ABC-DEFAULT - array
1ABC-DEFAULT
2ABC
3NO MATCH

ABC-123 - struct
LEN
ABC-123 - array
17
23
33
POS
ABC-123 - array
11
21
35
_MATCH
ABC-123 - array
1ABC-123
2ABC
3123

ABC- - struct
LEN
ABC- - array
14
23
30
POS
ABC- - array
11
21
35
_MATCH
ABC- - array
1ABC-
2ABC
3ZERO-LENGTH MATCH

This demonstrates various permutations of matches:
  • zero pos and zero len in the first element: no match at all. Equiv to a 0 result from a standard find() operation.
  • Zero pos and len in one of the other elements: no match for that specific sub-expression, but over-all there was still a match. This implies the subexpression was optional.
  • A positive value for both pos and len: there was a substring match.
  • A positive value for pos, but a zero length. There was a match, but it was zero-length.
Hopefully that range of unduly repetitious examples demonstrates the types of matches (or lack thereof) one can expect from reFind().

Starting position

The other argument I glossed over was the "starting position" one - the third argument - there's not much to say about this other than to be aware of it. It works the same as with find(). One thing to note, and kind of demonstrates why not using the fourth argument as true with reFind() isn't very useful, is that if one wants to cycle through all the matches in a string, the normal approach is to do this sort of thing:

// pseudo code
set findStartPos = 1
while (findOperation matches something){
    process the match
    set findStartPos = after last char of this match
}

This step is not really possible with a lot of patterns (ones that don't necessarily match a fixed length substring), unless one returns sub-expressions. Without the sub-expression data being returned, one has only the start position of the match, but one doesn't know how long it is, so one can't skip past it.

reFindNoCase()

This function is kinda redundant, given one can simply specify (?i) in the regex to make it case-insensitive anyhow.



Blimey. I was hoping to cover all the CFML regex functionality in one article, but I'm up to 1800 words on just reFind(). So I'm gonna stop here, before you nod off (if you haven't already), and continue with reReplace() in the next article.

Until the next time...

--
Scheherazade

Thursday, 3 January 2013

Regular expressions in CFML (part 6: syntax - flags and the odds 'n' sods that are left )

G'day:
This is part six of the series I started with the introduction article: "Regular Expressions in ColdFusion (part 1: overview)", and followed with a discussion entitled "Regular expressions in ColdFusion (part 2: concepts)". Then I moved onto syntax with:
"Regular expressions in ColdFusion (part 3: syntax - single characters)";
"Regular expressions in ColdFusion (part 4: syntax - repetition, sub-expressions and back-references)";
"Regular expressions in ColdFusion (part 5: syntax - look-arounds, and how the engine parses the string it's matching)".

Flags

There are a few different "modes" the regex engine can use when processing a string. These are summarised as follows:

FlagMeaning
(?x)This flags to ignore whitespace within the pattern, so one can split it across multiple lines and add comments (for clarity).
(?m)This specifies the string being matched within should be treated as multi-line, so the ^ and $ anchors can be used to denote the beginning and end of a line. The \A and \Z characters still denote the beginning and end of the entire string.
(?i)This specifies the pattern should be considered case-insensitve. This is somewhat redundant in CFML as there are separate functions for case-sensitive and case-insensitive operations. This still works though.

Wednesday, 26 December 2012

Regular expressions in CFML (part 5: syntax - look-arounds, and how the engine parses the string it's matching)

G'day:
This is part five of the series I started with the introduction article: "Regular Expressions in ColdFusion (part 1: overview)", and followed with a discussion entitled "Regular expressions in ColdFusion (part 2: concepts)". Then I moved onto syntax with "Regular expressions in ColdFusion (part 3: syntax - single characters)" and "Regular expressions in ColdFusion (part 4: syntax - repetition, sub-expressions and back-references)".

Please note that this article - more so than the other ones - does not stand-alone. It's more a "part 2" of the preceding article. I advise reading at least that one before reading this one. But, really, one should read the whole lot in order. But bring food and water with you before starting.

Look-arounds

Another type of sub-expression is a look-around. Look-arounds do what the name suggests: they look around to see if there's a match (or there specifically isn't a match). They can look ahead from the current position in the string being processed, or they can look behind. So we have four different sorts of look-around:
  • positive look-ahead;
  • negative look-ahead;
  • positive look-behind;
  • negative look-behind.
ColdFusion's regex engine only supports look-aheads (both positive and negative ones).

Before I explain how these work, let's back up a bit and examine what goes on when a regular expression pattern matching exercise takes place on a given string.

Monday, 24 December 2012

Regular expressions in CFML (part 4: syntax - repetition, sub-expressions and back-references)

G'day:
This is part four of the series I started with the introduction article: "Regular Expressions in ColdFusion (part 1: overview)", and followed with a discussion entitled "Regular expressions in ColdFusion (part 2: concepts)". Then I moved onto syntax with "Regular expressions in ColdFusion (part 3: syntax - single characters)".

Repetition (again)

In a previous article I listed the types of repetition one can express in a regex:
  • Zero times
  • Zero or one times
  • Zero our more times
  • One time
  • One our more times
  • An exact number of times
  • Between a specified minimum and maximum number of times
  • At least a minimum number of times

Saturday, 22 December 2012

Regular expressions in CFML (part 3: syntax - single characters)

G'day:
This is part three of the series I started with the introduction article: "Regular Expressions in ColdFusion (part 1: overview)", and followed with a discussion entitled "Regular expressions in ColdFusion (part 2: concepts)".

Syntax

OK, so I'm now going to try to describe how all those concepts are actually reflected in regex syntax. This is where I am going to dispense with most of the keys on the keyboard, and just use the punctuation keys ;-)

(not really)

I'll go through each of those concepts in turn.

Thursday, 20 December 2012

Regular expressions in CFML (part 2: concepts)

G'day:
This is part two of the series I started with the introduction article: "Regular Expressions in ColdFusion (part 1: overview)". Initially I set out to only write one article, but by the time it was over 8000 words long, I figured I should split it up and serialise it otherwise no-one would be brave enough to read the whole thing.  If you see references in the text to "see above" or "see below", it might refer to something in one of the other articles. I'll try to find them all and replace with links, but no-doubt I'll miss some.

Concepts / components of a regular expression

A regular expression is built out of all those seemingly random sequences of punctuation characters. Taken en masse, that's quite impenetrable, and it's good to take a step back to consider the various notions / components / building blocks that are important to regexes. This is just a narrative / conceptual discussion, rather than delving into syntax, just for the moment. Knowing about the concepts is more important as the minutiae of the syntax, and once we have a handle on the concepts, the synactical vagaries will make more sense, and the expressions themselves will seem less daunting. In theory ;-)

Tuesday, 18 December 2012

Regular expressions in CFML (part 1: overview)

G'day:

Before I start
This started off being a single article, but it ended up way too long. And by the time I've come to be divvying it up, I'm thoroughly fed-up with it, so I'm not going to do the usual book-ending of each "sub-article", I'm just gonna fairly arbitrarily cut the thing into sections, and post each section as a new article. So it's gonna be best to start at the beginning and work through to the end, as the subsequent articles might not be fully contextualised in and of themselves. I'm also gonna serialise the thing over a few days, rather than releasing them all at once. Anyway... hold on to yer hat... regular expressions...

I've been mulling-over writing this article since I first started this blog. On one hand I'm fairly good with regular expressions, and a lot of ColdFusion developers (most of the ones I have encountered, anyhow) are not. So there's potential for a teaching exercise there. On the other hand Ben Nadel had banged-on so much about regexes so much in the past one might think there's little else left say on the matter.