September 23, 2015 at 9:24 pm
Comments posted to this topic are about the item A Pattern for Email Address storage and retrieval
September 24, 2015 at 1:54 am
Nice exercise but you still have an unique index on EmailAddress so why not instead of the two indexes, make one unique over EmailCheckSum and EmailAddress and miss all the burden with the temp table - selecting directly rows matching them both...
September 24, 2015 at 2:45 am
I think because it would need to compare both - the Checksum and the email if you would add both to an index / where condition, even if the checksum is unique
And comparing a nvarchar(255) to another is slow (compared to comparing the checksum-INT).
On the other hand it depends on the real szenario if the overhead (creating a table var, inserting into it, querying it) is faster.
If the example would not be about email adresses but about files (e.g. attachments as varbinary(max)). In this case the compare of the CHECKSUM (or better HASHBYTES('SH1', file) + the filesize (for reducing the amount wrong positives)) would be faster than comparing some 20 MB files direct
September 24, 2015 at 5:58 am
One observation... Email addresses are case insensitive in practice and some email systems even allow the dot to be ignored in matching the local part! A user may use different versions of the same email address. ABCname@Gmail.com is the same as abcname@gmail.com or AB.Cname@GMAIL.com. If your example uses the upper function and removing the dot in the local part (to the left of the @ sign) for both storing/comparing the email address and the calculation of the checksum it would be improved. Reference https://en.wikipedia.org/wiki/Email_address
September 24, 2015 at 6:07 am
Is there any particular reason for storing the email address as NVARCHAR?
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
September 24, 2015 at 6:14 am
I'd be curious to see how this would work to search and return a range of similar addresses, say from a given domain such as all the email addresses from @mydomain.com.
____________
Just my $0.02 from over here in the cheap seats of the peanut gallery - please adjust for inflation and/or your local currency.
September 24, 2015 at 6:32 am
Did you try this single-step process?
SELECT EmailAddressID , EmailAddress
FROM dbo.EmailAddress
WHERE EmailCheckSum = CHECKSUM('TheMostInterestingMan@dosequis.com')
AND EmailAddress = 'TheMostInterestingMan@dosequis.com'
Although there are no guarantees of the order of evaluation of conditionals, the fact that EmailCheckSum is the index key and EmailAddress is included in the index should lead the optimizer to an index seek, and allow an efficient exact search without employing a table variable.
September 24, 2015 at 6:35 am
First off our data is fairly well normalized so we don't have the same data strung out over multiple tables, with the exception of keys of course. Our EmailAddress table has an unique alternate index on the EmailAddress field. It seems efficient and not caused us any grief time-wise in managing the data. Your proposal may be faster but I would think for us it is just an interesting exercise. I certainly wouldn't propose our staff spend any time messing with this when we have so many other pressing issues. Thank you anyway. It was an interesting read nonetheless.
September 24, 2015 at 6:41 am
I apologize for asking ahead of time, but why the CHECKSUM()? yes I understand the unique number, most of the time, but why not a unique index on the actual email address? maybe a GUID if you need something unique? thanks.
September 24, 2015 at 6:49 am
Good point re UPPER()
Also, why stop at one CHECKSUM() when you could use two -- one for the local part and one for the domain -- and a bigint. This would dramatically lower the collisions. And lose the '@' because all email addresses contain it so it adds nothing.
select convert(bigint, checksum(upper(substring('John.Smith@example.com', 1, charindex('@', 'John.Smith@example.com')-1))))
* power(convert(bigint, 2),32) | checksum(upper(substring('John.Smith@example.com', charindex('@', 'John.Smith@example.com')+1, 255)))
September 24, 2015 at 6:50 am
It is not hard to create email addresses where Checksums overlap.
if checksum('ks@gmail.com') = checksum('nc@gmail.com')
print 'Checksum function is weak'
If you really want to do something like this, I'd recommend hashbytes:
if convert(bigint,hashbytes('md5','ks@gmail.com')) <> convert(bigint,hashbytes('md5','nc@gmail.com'))
print 'Hashbytes creates a better number space'
But really, I think the overhead that is introduced in this process is counter productive. Why not just index the column? SQL server will do the hashing for you in a way that matches your collation. SQL automatically computes, stores & maintains the hashes for you in a b-tree when you make an index on a varchar column.
September 24, 2015 at 6:59 am
Are we actually sure NVarchar(255) searches are inefficient? Since string comparisons are left to right the first 5 (certainly the first 10) characters would almost certainly eliminate the vast majority of the search space.
Given SQL Server is all about efficiency I'd be interested in seeing a performance breakdown between the Checksum() and index seek versions.
Although I absolutely agree email addresses should be their own entity.
September 24, 2015 at 7:21 am
Guys,
Just a quick note about CHECKSUM. If you are planning on using CHECKSUM in just saving the data and retrieving it, then I think you should be good.
I’ve burnt my hands in using CHECKSUM for the purpose of comparison and expecting it to return a unique value every time for every string and ran into two strings generating the exact same value for CHECKSUM. This was back in 2010 and we were using SQL Server 2008, so I thought this might've been fixed in SQL Server 2012. Looks like apparently not..
I ran the CHECKSUM function on SQL Server 2012 database for those two string values and sure enough, it still returns the same results:
Select CHECKSUM('SQL CAL 6.5 WinNT All Lng MVL ISV Royalty Embed')
Select CHECKSUM('SQL CAL Runtime 6.5 WinNT All Lng MVL ISV Royalty Embed Runtime')
September 24, 2015 at 7:26 am
roger.plowman (9/24/2015)
Given SQL Server is all about efficiency I'd be interested in seeing a performance breakdown between the Checksum() and index seek versions.
I managed a database with > 10 Million email addresses for a while. We did have the unique emails pulled out into a separate table which is probably a good practice in most relational data use cases.
The original developer had built a hashing process with an indexed computed column on the table, thinking that the numeric output had to perform better. In our usage case, it did not. It just used up extra resources, since we had to index the varchar representation of the email address anyway and I felt that it made the database relationships overly complicated.
Researching how the performance of the bigint hash could be worse than the varchar column, I recall discovering that:
1) the varchar index has some order, that can help some "like" queries
2) the varchar indexes contain collation specific hashes that are more efficient than a checksum or binary hash.
September 24, 2015 at 7:46 am
triman165910 (9/24/2015)
I apologize for asking ahead of time, but why the CHECKSUM()? yes I understand the unique number, most of the time, but why not a unique index on the actual email address? maybe a GUID if you need something unique? thanks.
Your two options give up the advantages of the CHECKSUM()- faster seek time on searches. The INT comparison is much faster than comparing text. Using the e-mail address as the unique index forces text comparison, while the GUID provides no value as a search parameter.
Edit - I should add that my statement above is regarding the intent of the original article. Doug's experience seems to demonstrate that in practice, this might not be the optimal solution.
Viewing 15 posts - 1 through 15 (of 37 total)
You must be logged in to reply to this topic. Login to reply