Sunday 24 February 2013

Email address validation (#1 in a series subtitled "Fool's Errands")

G'day (from sunny New Zealand):
The sunshine has been too hard to resist, so I've only been looking at my computer superficially over the last coupla weeks. That and typing anything of any length on this netbook is a bit of a pain in the butt.  However I have an hour or so downtime before descending on my sister (who I've not seen in three years) and reacquainting myself with her brood - including a new niece I have not yet met - so I'll rattle something off quickly about one of my pet annoyances.

Email address validation.

For goodness sake: don't do it! You'll probably get it wrong. Most websites I encounter do, so I don't see why you'd be any different.
OK: blanket advice to not validate email addresses was more intended as attention-grabbing than solid advice. I guess what I mean is make sure you're actually doing the job properly, rather than just using some random half-arsed regex validation which kinda works for obvious cases, but completely fails on "edge" cases (some not so edgy). I see this an awful lot. I think it's better to simply not bother to do it if you're not going to do it properly.

This is a bold statement to make, I know. However from experience of being denied access to websites due to my idiosyncratic email address (I use plus-addressing),  I can say that perhaps 75% of websites don't do it properly. The effect this causes when I'm involved is the website loses a sale, because if they tell me my email address is invalid when it's just their shonky validation: I'm not going to use them.

The motivation for writing this comes from reading a thread on CFTALK in which people are advocating half-arsed regex solutions.

People's approaches to email validation seems to be rather random. A lot of people seem to crease their brow and guess - based on their own experience - what constitutes a valid email address, and write some sort of regex to use to validate it. This is what Adobe's approach in ColdFusion and Railo's approach seems to be too, because they both make a pig's ear of it (I'd say Railo was simply copying CF's errors, but they get different things wrong, so it seems to be a case of just not doing the job properly either).

Here's some code that demonstrates some edge-case email addresses. All of which are completely legit:

// basic "special" chars
chars = ["!", "##", "$", "%", "&", "'", "*", "+", "-", "/", "=", "?", "^", "_", "`", "{", "|", "}", "~"];

for (char in chars){
    address = "";
    writeOutput("#address#: #isValid('email', address)#<br />");

// quoteable special chars
chars = ["(", ")", ":", ";", "<", ">", "@", "[", "]", ","];

for (char in chars){
    address = '"adam#char#cameron"';
    writeOutput("#htmlEditFormat(address)#: #isValid('email', address)#<br />");

// escapable special chars
chars = ['"', '\'];

for (char in chars){
    address = '"adam\#char#cameron"';
    writeOutput("#address#: #isValid('email', address)#<br />");

On ColdFusion (CF10, fully patched), I get this:

adam! NO NO
adam$ NO NO
adam& NO
adam' YES
adam* NO YES YES
adam/ NO NO
adam? NO
adam^ NO YES
adam` NO
adam{ NO
adam| NO
adam} NO YES
"adam(cameron" NO
"adam)cameron" NO
"adam:cameron" NO
"adam;cameron" NO
"adam<cameron" NO
"adam>cameron" NO
"adam@cameron" NO
"adam[cameron" NO
"adam]cameron" NO
"adam,cameron" NO
"adam\"cameron" NO
"adam\\cameron" NO

All of those should be "YES". I've indicated where Railo differs slightly from CF. So that's not great. I've raised a bug for this (some time ago): 3231157. It's marked as "to fix", which is something.

Anyway, that was a digression / example of where this sort of validation is done wrong. On the whole, hand-rolled regex solutions fare slightly worse than CF does, in my experience.

The point here is that this isn't something to simply guess about how it's supposed to work, and one doesn't even need to guess, because it's all documented. For starters there's a good description on Wikipedia in the Email Address article, but there's also a full RFC (indeed more than one) to reference (links in the Wikipedia article).

Two things have occurred to me as a result of looking into email address validation, and how often people mess it up:
  1. The validation rules are reasonably complex, so to do them correctly is not just a matter of knocking a quick regex together.
  2. All the regexes in the world still won't prove that I've got my email address correct. I could still give you this: That's syntactically correct, but I've got a typo in it.
To deal with the second point often forms ask for people to enter their address in twice. This is fine for people who don't know about copy and pasting, but for most (?) people, they're just going to copy and paste what they initially typed in, typos 'n' all.

About the only safe way of validating that the person has given you a valid email address is to send them an email as part of the sign-up process, and rely on them responding to it to verify they've give you a legit & correct email address.

Whatever you do, don't bar a user from progressing on the form based on a validation failure of their email address. It's as likely a problem with your validation as it is with them getting it wrong. What I recommend is that if they fail client-side validation, flag it up as a warning on the form, bringing it to their attention.  But if they're sure they've got it right, then let them proceed.

I didn't quite get this finished yesterday before heading over to see my sister and tribe. That went down well, but my nephews (8, almost 5, and almost 3) are exhausting. And my niece - 5 months - has a very good set of lungs on her that she was exercising comprehensively (not crying, just clucking and squarking away to herself). Plus a crazy dog to throw tennis balls to, and wrestle with whilst trying to chat and drink my wine.  Today - indeed in a few min - I am off up Rangitoto, which is the volcanic island in the middle of Auckland Harbour. I've got one more day left of NZ summer, then back to the miserable London winter on Tuesday evening :-(