HASHBYTES can help quickly load a Data Warehouse

Question

HASHBYTES can help quickly load a Data Warehouse

Viewing 4 posts - 106 through 109 (of 109 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 1

odeddror - Friday, December 8, 2017 6:11 PM
TomThomson - Friday, December 8, 2017 4:03 PM
Jeff Moden - Wednesday, December 6, 2017 8:52 PM
I'll add that if HASHBYTES is different between any two rows, then the rows are guaranteed to be different but, if they are the same,
there's not necessarily a guarantee that the rows are the same. You need to verify that rows with the same hashing are, in fact, the
same. I agree that the chances are very slim that they're not the same but you've already found out that "slim" doesn't mean "never". 😉
Depending on how many different rows are possible and how many different rows you actually have, "slim" may be quite close to "never" or a long way away from it. Roughly speaking, the chances of two different rows picked at random have the same hash are about 1 in 1 000000 000000 000000 000000 000000 000000 000000 000000 (1 in 10**48), but if you've got 3 million different rows picked at random the chance that there's a pair that have the same hash is 9 trillion times that (so about 1 in 10**35) which is a lot further from zero.
Jeff,
Thank you
Oded Dror

Yep... agreed BUT the OP had collisions. And, even if collisions are near "never", it still only "near" never. If you have two or more hashes that are identical and no matter how slim the possibility, the right thing to do is still check using another method to ensure that they actually are identical.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

odeddror SSC Eights! Points: 885 More actions · Answer 2

Jeff Moden - Friday, December 8, 2017 8:05 PM
odeddror - Friday, December 8, 2017 6:11 PM
TomThomson - Friday, December 8, 2017 4:03 PM
Jeff Moden - Wednesday, December 6, 2017 8:52 PM
I'll add that if HASHBYTES is different between any two rows, then the rows are guaranteed to be different but, if they are the same,
there's not necessarily a guarantee that the rows are the same. You need to verify that rows with the same hashing are, in fact, the
same. I agree that the chances are very slim that they're not the same but you've already found out that "slim" doesn't mean "never". 😉
Depending on how many different rows are possible and how many different rows you actually have, "slim" may be quite close to "never" or a long way away from it. Roughly speaking, the chances of two different rows picked at random have the same hash are about 1 in 1 000000 000000 000000 000000 000000 000000 000000 000000 (1 in 10**48), but if you've got 3 million different rows picked at random the chance that there's a pair that have the same hash is 9 trillion times that (so about 1 in 10**35) which is a lot further from zero.
Jeff,
Thank you
Oded Dror
Yep... agreed BUT the OP had collisions. And, even if collisions are near "never", it still only "near" never. If you have two or more hashes that are identical and no matter how slim the possibility, the right thing to do is still check using another method to ensure that they actually are identical.

Jeff,
I will thank you
Oded Dror

TomThomson SSC Guru Points: 104768 More actions · Answer 3

Jeff Moden - Friday, December 8, 2017 8:05 PM
Yep... agreed BUT the OP had collisions. And, even if collisions are near "never", it still only "near" never. If you have two or more hashes that are identical and no matter how slim the possibility, the right thing to do is still check using another method to ensure that they actually are identical.

Yes, he has enough collisions to make me believe there's something strange going on. Assuming he's running 3 million or fewer new rows against his 3 million existing rows the chances of getting 7 or 8 collisions are so small that if he's gets that many even once in a thousand runs we can be rather certain that something very odd is going on - possibly something like what Nadrek suggested a couple of days ago.

And the only way to be sure that two things are identical is to compare them directly. Hash comparisons are nothing other than a way to reduce the number of direct full comparisons needed because hash comparisons discover most of the non-matches cheaply. They certainly are not something that can prove a match, so when the hashes match you do a full compare.

Tom

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 4

TomThomson - Saturday, December 9, 2017 4:31 PM
Jeff Moden - Friday, December 8, 2017 8:05 PM
Yep... agreed BUT the OP had collisions. And, even if collisions are near "never", it still only "near" never. If you have two or more hashes that are identical and no matter how slim the possibility, the right thing to do is still check using another method to ensure that they actually are identical.
Yes, he has enough collisions to make me believe there's something strange going on. Assuming he's running 3 million or fewer new rows against his 3 million existing rows the chances of getting 7 or 8 collisions are so small that if he's gets that many even once in a thousand runs we can be rather certain that something very odd is going on - possibly something like what Nadrek suggested a couple of days ago.
And the only way to be sure that two things are identical is to compare them directly. Hash comparisons are nothing other than a way to reduce the number of direct full comparisons needed because hash comparisons discover most of the non-matches cheaply. They certainly are not something that can prove a match, so when the hashes match you do a full compare.

I totally agree that the number of collisions the OP has are a bit insane. I've not seen so many even with the old MDF algorithm. Thanks for the feedback, Tom.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)