Monday 1 December 2014

Hammers and screws (and when not to use a list in CFML)

G'day:
I hope Salted (/Ross) and Ryan don't mind me pinching our IRC conversation for a blog article, but... well... too bad if they do. It's done now. This is it.

Ross was asking "whats the most efficient way to remove the first two items in a list and get back the remainder", which made my eyes narrow and decide there's probably something to get to the bottom of here. Because he mentioned "lists". And these should - in general - be avoided in CFML.



After some back and forth with Ryan, Ross settled on:

listtoarray(foo," "), arrayslice(bar,3), arraytolist(bar," ")


(pseudo code, obviously).

Now... I think lists are overused in CFML, and - indeed - they are a bit rubbish. Having a string of data and then using one character (OK, or "some characters ~") from the string as a delimiter is seldom going to work, especially with the default delimiter of comma. Because in really a lot of situations, commas are actually part of the data. And same with any other char-based delimiter (except say characters specifically intended to be delimiters like characters 28-31 (file, group record and unit separators, respectively). I really don't think lists should generally be part of a solution to anything.

With this in mind, I decided to check what was going on. This is copy-and-pasted from the IRC log (spelling and formatting addressed, otherwise it's unabridged):

16:36 <adam_cameron> Salted: first question that should be asked... why is it a list in the first place? Do you have any control over it?
16:36 <Salted> no, it's returned from a postcode lookup service
16:36 <Salted> I'm fairly certain that the address string will always have the postcode first so [p1] [pt2] [line1]
16:37 <Salted> it's not a list its a string but I'm treating it as a space delimited list for these purposes...
16:37 <adam_cameron> so it's a string? Not a list per se?
16:37 <adam_cameron> ok
16:37 <Salted> indeed
16:37 <adam_cameron> So what you actually want is everything after the second space?
16:37 <Salted> essentially yes
16:37 <adam_cameron> Not a list with the first two elements dropped off
16:37 <Salted> in this instance what you've mentioned is one and the same, if I treat it as a list
16:38 <adam_cameron> well literally, yes. However once you start treating it as a list, you're thinking about it wrong
16:39 <Salted> I don't see how thats different from thinking of strings as char arrays, for example
16:40 <Salted> "my name is ross" is both a char array of 14 elements and a list of 4 elements
16:40 <adam_cameron> because you start trying to use list functions on it
16:40 <adam_cameron> Which makes your code shit


My point here - one that Ross doesn't entirely agree with - is that just because one can use a string as a list, doesn't mean - semantically - it is a list. And one should try to keep one's code as semantically clear as possible.

What Ross has here is not a space-separated list. Later in the conversation I extracted form him what the actual data was like, and this was a sample:

X1 2YZ Apartment 1903, 10 Random Street Suburbia, OURTOWN


What we have here is... a mess. It can be assessed in a number of ways:

  • a post code part followed by an address part
  • a post code followed by a street address followed by a city
  • a post code followed by the street part of the address, the suburb part of the address, and the city part of the address
But what it isn't is a space- (or comma-, or anything-) separated list. So one should not solve this problem using lists.

One should look at the actual requirement, and solve that.



The requirement here is that we have a string which has some address stuff in it, including a leading post code, and we want to get rid of the post code. The post code is the first two bits (where the bits are separated by a space). So this would work:

whatYouWant = reReplace(whatYouHave, "^(?:\S+\s){2}", "", "ONE");


Ryan contended that is not as good an answer:

16:42 <RyanGuill> why do you think regex is better for this case?
16:43 <adam_cameron> because it's a string operation
16:43 <adam_cameron> Use list functions if the thing you have is actually intended to be a list
16:43 <RyanGuill> so for speed? for readability? maintainability?
16:43 <adam_cameron> code clarity
16:44 <RyanGuill> I have to disagree that that regex is clearer, but to each their own


I think Ryan is suggesting that list and array operations are easier to understand than regexes, which might be the case. However one doesn't need to understand the actual regex to see what's going on there. And it can be made more clear again:

textToRemove = "^(?:\S+\s){2}";
whatYouWant = reReplace(whatYouHave, textToRemove, "", "ONE");

Or we can think some more, and this is not an exercise of getting rid of the post code, it's an exercise in just getting the address:

bothBits = "^(?:\S+\s){2}(.*)";
textToKeep = "\1";
whatYouWant = reReplace(whatYouHave, bothBits, textToKeep, "ONE");

(this is a variation of what I mentioned in the conversation, but made more obtuseclear).

I'm actually getting ahead of myself here. To this point we'd not seen an example of the data, and Ross now offered this:



Looking at that, I think the solution is actually this:

justTheAddress = trim(replaceNoCase(list, postcode, "", "ONE"));

I can't help but think going back to the original vendor and going "WTF are you doing? You've duplicated the post code there. Get rid!". I can't see how that "List" value would be useful to anyone. But anyway.

So there's plenty of more appropriate solutions which don't involve driving-in screws with a hammer.

The conversation with Ross and Ryan trailed off there, but I resumed it with my colleague Brian at the pub later on. Yes, this is the sort of thing we talk about at the pub on a Friday night. He made perhaps the most relevant observation about all this (paraphrase):

"the obscurity of regex syntax would be fine, because it would be in its own function anyhow! And the function name would describe the process, whilst hiding the implementation"

eg:

function extractAddressFromListField(listField){
    var postCodePattern = "^(?:\S+\s){2}";
    var justTheAddress = reReplace(listField, postCodePattern, "", "ONE");
    return justTheAddress;
}

And in Ross's calling code, he'd just have:

address = extractAddressFromListField(listField);

That is much clearer than the list/array operations and the replacement operation. And is good clean code. The factored-out code here does a distinct (and testable! You did test your planned code, right, Ross?) task here, so it should be removed from where it's currently implemented, and put in its own function. It also means that if one wants to change the implementation, it's a matter of simply swapping out the innards of the function, the tests would still pass, and the calling code doesn't need to be touched.

I'm pleased this conversation came up because it was a good thought exercise, and indeed Brian again managed to do a better job of cleaning code than I managed. So there's something for everyone to take away from this.

Cool.

And that's the end of lunchtime, so I shall now warn Ross I'm making an example of him, and press "send" on this.

--
Adam