May 4, 2011 at 12:17 pm
Hi All,
I wanted to know what peoples thoughts were on how best to identify SCDs.
I prefer to use the MERGE statement, and this is the context of this post.
I try to avoid using the checksum method across several columns in order to identify row changes as it's not 100% accurate, instead I prefer to compare all the fields in a statement like so:
SourceField1 <> DestField1
OR SourceField2 <> DestField2
My main problem with this is that if you have quite a wide table the code ends up being quite lengthy.
Interesting to here your opinions or if anyone using a better method?
May 4, 2011 at 12:29 pm
Here's what Microsoft says on it:
CHECKSUM satisfies the properties of a hash function: CHECKSUM applied over any two lists of expressions returns the same value if the corresponding elements of the two lists have the same type and are equal when compared using the equals (=) operator. For this definition, null values of a specified type are considered to compare as equal. If one of the values in the expression list changes, the checksum of the list also generally changes. However, there is a small chance that the checksum will not change. For this reason, we do not recommend using CHECKSUM to detect whether values have changed, unless your application can tolerate occasionally missing a change. Consider using HashBytes instead. When an MD5 hash algorithm is specified, the probability of HashBytes returning the same result for two different inputs is much lower than that of CHECKSUM.
I've never had any trouble with using an MD5 hash to find row changes.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
May 4, 2011 at 12:35 pm
Thanks G-Squared,
I remember reading that text in the past.
I have used for binary_checksum and MD5 in the past - really interested to see what other ppl are using.
May 4, 2011 at 2:06 pm
checksum can produce duplicates !
So, even when using a checksum column to identify a row, best is to still add e.g. primary key columns to be sure you are modifying the intended row
Johan
Learn to play, play to learn !
Dont drive faster than your guardian angel can fly ...
but keeping both feet on the ground wont get you anywhere :w00t:
- How to post Performance Problems
- How to post data/code to get the best help[/url]
- How to prevent a sore throat after hours of presenting ppt
press F1 for solution, press shift+F1 for urgent solution 😀
Need a bit of Powershell? How about this
Who am I ? Sometimes this is me but most of the time this is me
May 5, 2011 at 7:26 am
I was initially happy to use it but CHECKSUM; but it does not guarantee uniqueness. Since it is returns an interger value the chances of collisions are a lot higher when u deal with large sets of data.
My advice would be to stay away from it.
May 5, 2011 at 7:48 am
Examples of Checksum vs. Hashbytes
Checksum:
DECLARE @guid1 UNIQUEIDENTIFIER ,@guid2 UNIQUEIDENTIFIER
SELECT @guid1 = '3DB7D309-A8F4-47C4-BA90-0CB458B44CB0' , @guid2 = 'EFE7F2C5-19F9-42B9-9C16-21BED41E882B'
SELECT chksum_guid1 = CHECKSUM(@guid1), chksum_guid2 = CHECKSUM(@guid2)
chksum_guid1 = chksum_guid2 =2016662652
This evaluates to the the same Checksum. This is because checksum has only 2^32 (4 bytes) values.
In the case of MD5:
MD5 = 2^128 (16 bytes)
DECLARE @guid1 varchar ,@guid2 varchar
SELECT @guid1 = '3DB7D309-A8F4-47C4-BA90-0CB458B44CB0' , @guid2 = 'EFE7F2C5-19F9-42B9-9C16-21BED41E882B'
SELECT hashMD5_guid1 = HASHBYTES('MD5',@guid1), hashMD5_guid2 = HASHBYTES('MD5',@guid2)
hashMD5_guid1 = 0xECCBC87E4B5CE2FE28308FD9F2A7BAF3
hashMD5_guid2 = 0x3A3EA00CFC35332CEDF6E5E9A32E94DA
Using SHA1:
SHA1 = 2^160 (20 bytes)
DECLARE @guid1 varchar ,@guid2 varchar
SELECT @guid1 = '3DB7D309-A8F4-47C4-BA90-0CB458B44CB0' , @guid2 = 'EFE7F2C5-19F9-42B9-9C16-21BED41E882B'
SELECT hashSHA1_guid1 = HASHBYTES('SHA1',@guid1), hashSHA1_guid2 = HASHBYTES('SHA1',@guid2)
hashSHA1_guid1 = 0x77DE68DAECD823BABBB58EDB1C8E14D7106E83BB
hashSHA1_guid2 = 0xE0184ADEDF913B076626646D3F52C3B49C39AD6D
May 5, 2011 at 10:35 am
vishal.gamji (5/5/2011)
I was initially happy to use it but CHECKSUM; but it does not guarantee uniqueness. Since it is returns an interger value the chances of collisions are a lot higher when u deal with large sets of data.My advice would be to stay away from it.
You should not use CHECKSUM for unquieness (I wouldn't use a HASH either). It should only be used to identify changes. Meaning, you should have a PK and you would use a checksum or hash to see if any of the values for that row have changed.
The advantage of the CHECKSUM is that it generates an INT and not a VARBINARY (up to 8000 bytes).
May 5, 2011 at 11:09 am
I was trying perform a differential load from a DB2 to SQL Server. The DB2 tables have absolutely no way (no PK, no timestamp) of identifying uniquely a row. So initially i tried to checksum all the columns of a table to form the PK but did not work due to collisions (even with tables that had 100+ columns).
So, i was using the CHECKSUM not just for uniqueness but also for comparing, which obviously does not work well. I should have put this explanation the first time. My bad.
May 5, 2011 at 11:09 am
Sorry dupe post.
Viewing 9 posts - 1 through 8 (of 8 total)
You must be logged in to reply to this topic. Login to reply