Joining on MD5 hash column

  • Has anyone done any large-scale joins based off an MD5 hash? I'm inserting to a table based off of a join to several (8) shallow tables on nchar values, the average length of which is 10. A colleague suggested speeding up my query by replacing the many joins with a single join based on an MD5 hash of the concatenated nchar values. This would, of course, also require me to create a corresponding MD5 hash of said values in a tertiary table earlier on in the process, but I'm less concerned with that.

    The biggest flaw in my mind is that I'm dealing with 700MM rows and, if I'm not mistaken, MD5 would start running into collision problems every 10MM or so. I could work around this by joining on one more of my original nchar columns as a quasi-parity check, but it would have to be a larger, more unique column like "StreetName" instead of, say, "State".

    So my big question is this - would joining on an MD5 column (about 700MM deep) + 1 shallow nchar(~20) column be instrinsically faster than joining on 8 shallow nchar(~10) columns, over 700MM starting rows?

    If you need more details let me know...just wondering if anyone knows of an immediate, obvious answer.

  • Are you saying that each of your tables has a composite key of 8 nchar fields? I have never tried doing something like your proposed MD5 idea but it seems like it will be slower in the long run. Each record will now have an extra 16 bytes of extra data (this is significant with 700 million rows). Plus this new hash is not the total of what you want to join on. With proper indexing I don't think you will find any performance gains. In fact, i think you will find it is ultimately going to hinder your performance. You will have to make your hash for every single insert on every single table. YUCK!!! Just my 2ยข. There are others on here with more experience on this scale than myself so hopefully some of them will jump in.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 โ€“ Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Sean, that's basically what I'd figured but just wanted to throw it out for the wolves to chew on ๐Ÿ™‚

  • I note that you never stated WHY you are even considering a refactor? Is your existing query slow? And if so, have you done the following:

    1) wait stats analysis

    2) file IO stall analysis

    3) examined query plan for opportunities to refactor or improve (especially using intermediate results to avoid joining 8 tables at once)

    4) done an indexing strategy session to evaluate options along those lines

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Yes, yes, yes, and yes ๐Ÿ™‚ The reason I'm considering a refactor is because the step is mind-numbingly slow.

    I've been able to establish that our biggest limitation right now is hardware, and even more specifically disk performance. We're still in the POC phase that, unfortunately, was forced into a full-scale loadout without provisioning us with the proper physical resources.

  • what type of plan are you getting?

    are you filtering any rows or doing stuff (aggs?) with all 700M of them?

    Sounds like an interesting problem, although I will say that sometimes you simply do need better hardware. ๐Ÿ˜€

    Got table script/query/plans you can share? Any chance it is doing nested loop joins on some of the big numbers of rows (perhaps due to poor estimated rowcounts)? That is a disaster that often befalls complex queries. Opposite side of the coin is scan/hash when a filter could restrict down to just a few rows.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • So you have an 8 column composite key? Yikes..

    First, I probably wouldn't use MD5, the collision domain is to small for this purpose, how about SHA1? MD5 is basically a varbinary(16) column and SHA1 is varbinary(20). You could take yourself out of the hashing it yourself game by adding a computed column that is persisted, then when the record is inserted OR changed (if those key fields can be changed) then the hash would be recalculated but when it is queried it already has a value, you could even index it.

    Also, I wouldn't name the field specific to a particular hash, like KeySHA1Hash, I'd use something more generic like KeyHash.

    However, I would seriously consider a major redesign, an 8 column composite key would get die on the drawing board with me, and for all the reasons you are experiencing.

    CEWII

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply