Checksum vs Comparing each row

Question

Checksum vs Comparing each row

aaa121

SSCertifiable

Points: 5902
More actions
May 4, 2011 at 12:17 pm

#236917

Hi All,
I wanted to know what peoples thoughts were on how best to identify SCDs.
I prefer to use the MERGE statement, and this is the context of this post.
I try to avoid using the checksum method across several columns in order to identify row changes as it's not 100% accurate, instead I prefer to compare all the fields in a statement like so:
SourceField1 <> DestField1
OR SourceField2 <> DestField2
My main problem with this is that if you have quite a wide table the code ends up being quite lengthy.
Interesting to here your opinions or if anyone using a better method?

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

GSquared SSC Guru Points: 260824 More actions · Answer 1

Here's what Microsoft says on it:

CHECKSUM satisfies the properties of a hash function: CHECKSUM applied over any two lists of expressions returns the same value if the corresponding elements of the two lists have the same type and are equal when compared using the equals (=) operator. For this definition, null values of a specified type are considered to compare as equal. If one of the values in the expression list changes, the checksum of the list also generally changes. However, there is a small chance that the checksum will not change. For this reason, we do not recommend using CHECKSUM to detect whether values have changed, unless your application can tolerate occasionally missing a change. Consider using HashBytes instead. When an MD5 hash algorithm is specified, the probability of HashBytes returning the same result for two different inputs is much lower than that of CHECKSUM.

I've never had any trouble with using an MD5 hash to find row changes.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

aaa121 SSCertifiable Points: 5902 More actions · Answer 2

Thanks G-Squared,

I remember reading that text in the past.

I have used for binary_checksum and MD5 in the past - really interested to see what other ppl are using.

Johan Bijnens SSC Guru Points: 135369 More actions · Answer 3

checksum can produce duplicates !

So, even when using a checksum column to identify a row, best is to still add e.g. primary key columns to be sure you are modifying the intended row

Johan

Learn to play, play to learn !

Dont drive faster than your guardian angel can fly ...
but keeping both feet on the ground wont get you anywhere :w00t:

- How to post Performance Problems
- How to post data/code to get the best help[/url]

- How to prevent a sore throat after hours of presenting ppt

press F1 for solution, press shift+F1 for urgent solution 😀

Need a bit of Powershell? How about this

Who am I ? Sometimes this is me but most of the time this is me

vishal.gamji SSCertifiable Points: 6126 More actions · Answer 4

I was initially happy to use it but CHECKSUM; but it does not guarantee uniqueness. Since it is returns an interger value the chances of collisions are a lot higher when u deal with large sets of data.

My advice would be to stay away from it.

vishal.gamji SSCertifiable Points: 6126 More actions · Answer 5

Examples of Checksum vs. Hashbytes

Checksum:

DECLARE @guid1 UNIQUEIDENTIFIER ,@guid2 UNIQUEIDENTIFIER

SELECT @guid1 = '3DB7D309-A8F4-47C4-BA90-0CB458B44CB0' , @guid2 = 'EFE7F2C5-19F9-42B9-9C16-21BED41E882B'

SELECT chksum_guid1 = CHECKSUM(@guid1), chksum_guid2 = CHECKSUM(@guid2)

chksum_guid1 = chksum_guid2 =2016662652

This evaluates to the the same Checksum. This is because checksum has only 2^32 (4 bytes) values.

In the case of MD5:

MD5 = 2^128 (16 bytes)

DECLARE @guid1 varchar ,@guid2 varchar

SELECT @guid1 = '3DB7D309-A8F4-47C4-BA90-0CB458B44CB0' , @guid2 = 'EFE7F2C5-19F9-42B9-9C16-21BED41E882B'

SELECT hashMD5_guid1 = HASHBYTES('MD5',@guid1), hashMD5_guid2 = HASHBYTES('MD5',@guid2)

hashMD5_guid1 = 0xECCBC87E4B5CE2FE28308FD9F2A7BAF3

hashMD5_guid2 = 0x3A3EA00CFC35332CEDF6E5E9A32E94DA

Using SHA1:

SHA1 = 2^160 (20 bytes)

DECLARE @guid1 varchar ,@guid2 varchar

SELECT @guid1 = '3DB7D309-A8F4-47C4-BA90-0CB458B44CB0' , @guid2 = 'EFE7F2C5-19F9-42B9-9C16-21BED41E882B'

SELECT hashSHA1_guid1 = HASHBYTES('SHA1',@guid1), hashSHA1_guid2 = HASHBYTES('SHA1',@guid2)

hashSHA1_guid1 = 0x77DE68DAECD823BABBB58EDB1C8E14D7106E83BB

hashSHA1_guid2 = 0xE0184ADEDF913B076626646D3F52C3B49C39AD6D

Lamprey13 SSCertifiable Points: 6210 More actions · Answer 6

vishal.gamji (5/5/2011)
I was initially happy to use it but CHECKSUM; but it does not guarantee uniqueness. Since it is returns an interger value the chances of collisions are a lot higher when u deal with large sets of data.
My advice would be to stay away from it.

You should not use CHECKSUM for unquieness (I wouldn't use a HASH either). It should only be used to identify changes. Meaning, you should have a PK and you would use a checksum or hash to see if any of the values for that row have changed.

The advantage of the CHECKSUM is that it generates an INT and not a VARBINARY (up to 8000 bytes).

vishal.gamji SSCertifiable Points: 6126 More actions · Answer 7

I was trying perform a differential load from a DB2 to SQL Server. The DB2 tables have absolutely no way (no PK, no timestamp) of identifying uniquely a row. So initially i tried to checksum all the columns of a table to form the PK but did not work due to collisions (even with tables that had 100+ columns).

So, i was using the CHECKSUM not just for uniqueness but also for comparing, which obviously does not work well. I should have put this explanation the first time. My bad.

vishal.gamji SSCertifiable Points: 6126 More actions · Answer 8

vishal.gamji

SSCertifiable

Points: 6126

May 5, 2011 at 11:09 am

#1320569

Sorry dupe post.