October 10, 2013 at 4:36 pm
Every month I receive a large file from a vendor, and the first step in the ETL process is to determine if any record in the file is identical to a record that we've been sent in the past. Solution is to import it into a working table in the database and then perform an exact duplicate check (all columns identical in both tables) to the final target table.
Although I can script out a huge column-to-column test for a WHERE clause or a CASE expression, it would seem more performant to generate a hash value to either store in a separate table or to index on. However, the target table has dozens of columns and the row size exceeds 8000, so using hashbytes() is right out. I've thought about hashing subsets of the columns and then hashing the resulting hash values, but i fear producing false positives.
Can anyone tell me how they handle this kind of problem? Any constructive suggestions or alternatives would be welcome.
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
October 10, 2013 at 4:52 pm
Unless I'm missing something, I think using EXISTS would be a very approachable solution.
October 10, 2013 at 4:53 pm
SELECT CHECKSUM(*) FROM StagingTable
Collisions are possible, so you'll have to do a column-by-column check over any matches, but should be faster. The main table can have a computed column with the checksum, with an index on it for faster matching.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
October 10, 2013 at 5:07 pm
Thanks 🙂
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
Viewing 4 posts - 1 through 3 (of 3 total)
You must be logged in to reply to this topic. Login to reply