August 28, 2014 at 8:19 am
TomThomson (8/28/2014)
Gazareth (8/28/2014)
Alumni of OUCL are you Tom? Good stuff 🙂Tony's PRG used to have what were called "research officers" which essentially were representatives of industrial concerns that they were doing research collaborations with, and I was one of them for a few years in teh 80s/90s. Of course I was a "senior member of the University" as well, having been given my MA decades before, which may have made it easier to fit in; and I had done a couple of one week doses of bring-up-to-date training at PRG a while before, which meant I spoke the local language too. But I was never a genuine member of PRG.
I'll bet that was an interesting time.
August 28, 2014 at 8:43 am
GilaMonster (8/27/2014)
sqldriver (8/27/2014)
GilaMonster (8/27/2014)
sqldriver (8/27/2014)
I used to use this. Not perfect by any stretch but got most of the junk out.Does it flag this as invalid?
http://gmailblog.blogspot.com/2008/03/2-hidden-ways-to-get-more-from-your.html
It does.
Which is why I point people at that linked blog post on this topic, as what is allowed is much broader than most regex statements allow. That, and the fact that I get tired of having valid emails rejected. I still sometimes get told my .za.net email is invalid (since za is usually a top-level domain)
Wouldn't it be nice if people who invent email address validators would RTFM - or rather, RTRFC because the "manual" in this case is the relevant sections of RFC 5322 as amended by RFC 6854 and it contains very few restrictions on the format of the part before the @ sign. It was RFC 822 last time I wrote any emailer code so I wouldn't trust myself to write a validator now.
Tom
August 28, 2014 at 10:02 am
It's tough, as Troy Hunt found out.
http://www.troyhunt.com/2013/11/dont-trust-net-web-forms-email-regex.html
Honestly, it's good to try to validate, but don't discard data that doesn't meet the standard you set. It's highly likely your validation is broken. Even if you say "it's all North America", people move all the time and you might end up with foreign items over time. Or your company expands.
Throw suspected items into a queue and have someone review them. Contact the person. If you do this in a front end form, give the user a way to bypass validation and give their email with a captcha or some other type of "human" validation that their email is correct.
August 28, 2014 at 10:16 am
Steve Jones - SSC Editor (8/28/2014)
It's tough, as Troy Hunt found out.http://www.troyhunt.com/2013/11/dont-trust-net-web-forms-email-regex.html
Honestly, it's good to try to validate, but don't discard data that doesn't meet the standard you set. It's highly likely your validation is broken. Even if you say "it's all North America", people move all the time and you might end up with foreign items over time. Or your company expands.
Throw suspected items into a queue and have someone review them. Contact the person. If you do this in a front end form, give the user a way to bypass validation and give their email with a captcha or some other type of "human" validation that their email is correct.
This seems sort of directed at me.
The email addresses I was processing were not from a web form, they were provided to us either by clients, or by data marts. Nothing was deleted, it was just given a 'status flag' that meant the email inspection found something wonky. A certain percentage of the contact information was expected to be inaccurate or incorrect. So if the project was running a long time without a high response rate, and the PM didn't want to buy more sample, the remainder email addresses could be released.
I had to do a lot of similar stuff with phone number cleaning/checking and formatting so the dialer could make the calls correctly.
Market research is a bummer.
*shrug*
September 2, 2014 at 9:56 am
GilaMonster (8/27/2014)
sqldriver (8/27/2014)
GilaMonster (8/27/2014)
sqldriver (8/27/2014)
I used to use this. Not perfect by any stretch but got most of the junk out.Does it flag this as invalid?
http://gmailblog.blogspot.com/2008/03/2-hidden-ways-to-get-more-from-your.html
It does.
Which is why I point people at that linked blog post on this topic, as what is allowed is much broader than most regex statements allow.
Yes, my favorite reference is Mail::RFC822::Address: regexp-based address validation
Which has a sample Perl RegEx for RFC822 validation, NOT including the arbitrarily nested comments allowed by the RFC.
(?:(?:\r)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r)?[ \t]))*"(?:(?:
\r)?[ \t])*)(?:\.(?:(?:\r)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(
?:\r)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r)?[
\t]))*"(?:(?:\r)?[ \t])*))*@(?:(?:\r)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0
31]+(?:(?:(?:\r)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*](?:(?:\r)?[ \t])*)(?:\.(?:(?:\r)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+
(?:(?:(?:\r)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:
(?:\r)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r)?[ \t])+|\Z
|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r)?[ \t]))*"(?:(?:\r)
?[ \t])*)*\<(?:(?:\r)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:r)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r)?[
\t])*)(?:\.(?:(?:\r)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r)
?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r)?[ \t]
)*))*(?:,@(?:(?:\r)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r)?[
\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r)?[ \t])*
)(?:\.(?:(?:\r)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r)?[ \t])*))*)
*:(?:(?:\r)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r)?[ \t])+
|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r)?[ \t]))*"(?:(?:\r
)?[ \t])*)(?:\.(?:(?:\r)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:
\r)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r)?[ \t
]))*"(?:(?:\r)?[ \t])*))*@(?:(?:\r)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031
]+(?:(?:(?:\r)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](
?:(?:\r)?[ \t])*)(?:\.(?:(?:\r)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?
:(?:(?:\r)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?
:\r)?[ \t])*))*\>(?:(?:\r)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?
:(?:\r)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r)?
[ \t]))*"(?:(?:\r)?[ \t])*)*:(?:(?:\r)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\]
\000-\031]+(?:(?:(?:\r)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|
\\.|(?:(?:\r)?[ \t]))*"(?:(?:\r)?[ \t])*)(?:\.(?:(?:\r)?[ \t])*(?:[^()<>
@,;:\\".\[\] \000-\031]+(?:(?:(?:\r)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"
(?:[^\"\r\\]|\\.|(?:(?:\r)?[ \t]))*"(?:(?:\r)?[ \t])*))*@(?:(?:\r)?[ \t]
)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r)?[ \t])+|\Z|(?=[\["()<>@,;:\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r)?[ \t])*)(?:\.(?:(?:\r)?[ \t])*(?
:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[
\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-
\031]+(?:(?:(?:\r)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(
?:(?:\r)?[ \t]))*"(?:(?:\r)?[ \t])*)*\<(?:(?:\r)?[ \t])*(?:@(?:[^()<>@,;
:\\".\[\] \000-\031]+(?:(?:(?:\r)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([
^\[\]\r\\]|\\.)*\](?:(?:\r)?[ \t])*)(?:\.(?:(?:\r)?[ \t])*(?:[^()<>@,;:\\"
.\[\] \000-\031]+(?:(?:(?:\r)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[]\r\\]|\\.)*\](?:(?:\r)?[ \t])*))*(?:,@(?:(?:\r)?[ \t])*(?:[^()<>@,;:\\".[\] \000-\031]+(?:(?:(?:\r)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]r\\]|\\.)*\](?:(?:\r)?[ \t])*)(?:\.(?:(?:\r)?[ \t])*(?:[^()<>@,;:\\".\[\]
\000-\031]+(?:(?:(?:\r)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]
|\\.)*\](?:(?:\r)?[ \t])*))*)*:(?:(?:\r)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \0
00-\031]+(?:(?:(?:\r)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\.|(?:(?:\r)?[ \t]))*"(?:(?:\r)?[ \t])*)(?:\.(?:(?:\r)?[ \t])*(?:[^()<>@,
;:\\".\[\] \000-\031]+(?:(?:(?:\r)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?
:[^\"\r\\]|\\.|(?:(?:\r)?[ \t]))*"(?:(?:\r)?[ \t])*))*@(?:(?:\r)?[ \t])*
(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r)?[ \t])*)(?:\.(?:(?:\r)?[ \t])*(?:[
^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]
]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r)?[ \t])*))*\>(?:(?:\r)?[ \t])*)(?:,\s*(
?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r)?[ \t])+|\Z|(?=[\["()<>@,;:\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r)?[ \t]))*"(?:(?:\r)?[ \t])*)(?:\.(?:(
?:\r)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r)?[ \t])+|\Z|(?=[
\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r)?[ \t]))*"(?:(?:\r)?[ \t
])*))*@(?:(?:\r)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r)?[ \t
])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r)?[ \t])*)(?
:\.(?:(?:\r)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r)?[ \t])+|
\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r)?[ \t])*))*|(?:
[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r)?[ \t]))*"(?:(?:\r)?[ \t])*)*\<(?:(?:\r)
?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r)?[ \t])+|\Z|(?=[\["
()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r)?[ \t])*)(?:\.(?:(?:\r)
?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r)?[ \t])+|\Z|(?=[\["()<>
@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r)?[ \t])*))*(?:,@(?:(?:\r)?[
\t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r)?[ \t])+|\Z|(?=[\["()<>@,
;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r)?[ \t])*)(?:\.(?:(?:\r)?[ \t]
)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r)?[ \t])+|\Z|(?=[\["()<>@,;:\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r)?[ \t])*))*)*:(?:(?:\r)?[ \t])*)?
(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r)?[ \t]))*"(?:(?:\r)?[ \t])*)(?:\.(?:(?:
\r)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r)?[ \t])+|\Z|(?=[\[
"()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r)?[ \t]))*"(?:(?:\r)?[ \t])
*))*@(?:(?:\r)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r)?[ \t])
+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r)?[ \t])*)(?:.(?:(?:\r)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r)?[ \t])+|\Z
|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r)?[ \t])*))*\>(?:(
?:\r)?[ \t])*))*)?;\s*)
P.S. I tried code="plain" but the sad faces, well, they're not stopped by that.
September 2, 2014 at 10:32 am
Nadrek (9/2/2014)
but the sad faces, well, they're not stopped by that.
They're not stopped by that. They cannot be stopped by anything. They're the imprisoned souls needed to power this most unholy of regexes, staring back at you from the abyss. They are coming.
September 2, 2014 at 11:04 am
sqldriver (8/28/2014)
The email addresses I was processing were not from a web form, they were provided to us either by clients, or by data marts. Nothing was deleted, it was just given a 'status flag' that meant the email inspection found something wonky. A certain percentage of the contact information was expected to be inaccurate or incorrect. So if the project was running a long time without a high response rate, and the PM didn't want to buy more sample, the remainder email addresses could be released.I had to do a lot of similar stuff with phone number cleaning/checking and formatting so the dialer could make the calls correctly.
Market research is a bummer.
Hey, you know it's your fault that the invalid email addresses and phone numbers don't work. Because they're invalid, that shouldn't impact the response rate, should it? Your situation is certainly not unique. Guess who else works in market research? 😀
September 2, 2014 at 11:07 am
Nadrek (9/2/2014)
Which has a sample Perl RegEx for RFC822 validation, NOT including the arbitrarily nested comments allowed by the RFC.
Heaven help you if you have to debug that thing. That to me does not look like a viable piece of code I would put anywhere near my system. That whole being able to support my code thing comes to mind. 😛
_______________________________________________________________
Need help? Help us help you.
Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.
Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.
Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/
September 2, 2014 at 11:10 am
Sean Lange (9/2/2014)
Nadrek (9/2/2014)
Which has a sample Perl RegEx for RFC822 validation, NOT including the arbitrarily nested comments allowed by the RFC.
That to me does not look like a viable piece of code I would put anywhere near my system.
+1000
September 2, 2014 at 11:27 am
Ed Wagner (9/2/2014)
sqldriver (8/28/2014)
The email addresses I was processing were not from a web form, they were provided to us either by clients, or by data marts. Nothing was deleted, it was just given a 'status flag' that meant the email inspection found something wonky. A certain percentage of the contact information was expected to be inaccurate or incorrect. So if the project was running a long time without a high response rate, and the PM didn't want to buy more sample, the remainder email addresses could be released.I had to do a lot of similar stuff with phone number cleaning/checking and formatting so the dialer could make the calls correctly.
Market research is a bummer.
Hey, you know it's your fault that the invalid email addresses and phone numbers don't work. Because they're invalid, that shouldn't impact the response rate, should it? Your situation is certainly not unique. Guess who else works in market research? 😀
What do you mean you don't know if these Thai phone numbers are cell phones? We can't have cell phones on the predictive dialer! CASRO will exile us!
September 2, 2014 at 11:37 am
sqldriver (9/2/2014)
Ed Wagner (9/2/2014)
sqldriver (8/28/2014)
The email addresses I was processing were not from a web form, they were provided to us either by clients, or by data marts. Nothing was deleted, it was just given a 'status flag' that meant the email inspection found something wonky. A certain percentage of the contact information was expected to be inaccurate or incorrect. So if the project was running a long time without a high response rate, and the PM didn't want to buy more sample, the remainder email addresses could be released.I had to do a lot of similar stuff with phone number cleaning/checking and formatting so the dialer could make the calls correctly.
Market research is a bummer.
Hey, you know it's your fault that the invalid email addresses and phone numbers don't work. Because they're invalid, that shouldn't impact the response rate, should it? Your situation is certainly not unique. Guess who else works in market research? 😀
What do you mean you don't know if these Thai phone numbers are cell phones? We can't have cell phones on the predictive dialer! CASRO will exile us!
Yes they will, unless you have permission to use cell phone numbers. I think that predictive dialers are falling out of favor anyway industry-wide, given all the restrictions. There are many little quirks with market research data. And watch where (geographically) that data is stored if you any business with European respondents.
I'm in the process of writing a decent email validation, but I'm a ways away from having anything yet. I will not be using the regular expression solution, as we already have one in place and it's a nightmare. I see phase 1 in my head, so it's just a matter of getting the time to write it, get it working and making it fast. Ideally for phase 2, I'd like to "call up" the email server, ask if the account lives there and store the server's response. I don't see that part yet, but one thing at a time.
September 2, 2014 at 11:41 am
Sean Lange (9/2/2014)
Nadrek (9/2/2014)
Which has a sample Perl RegEx for RFC822 validation, NOT including the arbitrarily nested comments allowed by the RFC.
Heaven help you if you have to debug that thing. That to me does not look like a viable piece of code I would put anywhere near my system. That whole being able to support my code thing comes to mind. 😛
Judging by RFC822, no accurate 'validate email' code is going to be viable or supportable except by a domain expert :w00t:.
September 2, 2014 at 5:41 pm
Nadrek (9/2/2014)
Judging by RFC822, no accurate 'validate email' code is going to be viable or supportable except by a domain expert :w00t:.
And if you go by RFC 822 you are going to have problems anyway, because it's out of date and no longer a valid description of the internet email address rules. The new RFC (RFC 2822) which superceded RFC 822 has itself been obsoleted by an even newer rfc (RFC 5322) and even RFC 5322 is partly obsoleted by RFC 6854 which replaces part of RFC 5322.
But being a domain expert is irrelevant anyway, because "valid" means "conforming to the internet rules for email addresses" rather than "being an actual email address" or "being an email address potentially allowed by the xyz.com receiving domain".
Tom
September 24, 2014 at 11:28 am
Noticed this today:
http://code.iamcal.com/php/rfc822/full_regexp.txt
If you let your vision go blurry and scroll up and down the page it kind of looks like relaxing ASCII waves on on a happy little ASCII beach.
Viewing 14 posts - 16 through 28 (of 28 total)
You must be logged in to reply to this topic. Login to reply