HASHBYTES can help quickly load a Data Warehouse

Question

HASHBYTES can help quickly load a Data Warehouse

Viewing 15 posts - 61 through 75 (of 108 total)

You must be logged in to reply to this topic. Login to reply

Mike C SSC-Insane Points: 23224 More actions · Answer 1

tarquin (4/19/2010)
Here I was believing in Kimball! To prove or disprove my guru's theory I went and looked at a client database that has been in operation for about a year. I found a table with a just under 20 million records - the logical key is a date and two ints (EffectiveDate, FundId and DataTypeId). The date is the most unique, fund a bit lower and datatypeid basically useless. I have created a checksum of the three keys roughly as follows: checksum(hashbytes(date as string), fundid, datatypeid)) which is stored along with the values at load time. The checksum calculation has almost no overhead at load.
The cardinality value when looking at statistics details are:
1) Checksum = 0.00005228
2) EffectiveDate = 0.0001738
2) FundId = 0.00057142
3) DataTypeId = 1
So my hashindex is more unique than any of the logical keys alone. This represents a moderate amount of data on a daily basis - more data would show a bigger difference.
Am I being dumb here - is this a wasted exercise?

CHECKSUM has little overhead because it performs similar to the following:

1. Grab 32 bits of input, put in current result

2. Shift 4 bits

3. XOR 32 bits with current result and store in current result

4. Repeat starting at 2 until end of data

If your logical key is (EffectiveDate, FundID, DataTypeID), then the cardinality of the combination of (EffectiveDate, FundID, DataTypeID) is what's important, not the cardinality of 1/3 of your key.

Nadrek SSC-Insane Points: 20039 More actions · Answer 2

Mike C (4/17/2010)
Not sure how you calculated the probability of collision at 2^40 off the top of your head... You can engineer an MD5 collision in 2^39 or less, but it's a pretty hard calculation to do without showing your work. In fact I'll post a very simple MD5 hash collision example shortly. This is why I recommend SHA-1 or better (probability of collision for 160-bit hash is 2^80 per the birthday paradox). Even the current crop of theoretical attacks on SHA-1 have only pushed the probability of forcibly generating an SHA-1 hash collision to 2^69 which is higher than the standard probability of collision with MD5. It's really hard to prove a negative and in most cases you're not going to save much by using a hash in a compromise solution to eliminate a few rows from full-blown multi-column comparisons. I'm putting together a blog entry to give folks who want to use hashes in this way a chance to prove their case.

I take it you haven't seen the McDonald, Hawkes and Pieprzyk paper from Eurocrypt 2009's rump session:

SHA-1 collisions now 2^52

Rick Todd Hall of Fame Points: 3302 More actions · Answer 3

I think there's a significant element missing from the "which way is faster to process" debate (hash then compare, or just compare): time to market. As in, how quickly can we code the solution. We're in the probably unusual, but far from unique situation where we have lots of data sources, but not a tremendous amount of data in each one, and in our case this is combined with small development resources. Let's call it a "wide and shallow" data warehouse. The result? We implemented something akin to Magarity's solution, with a couple of meta-data based SPs and TVFs that we can use over and over again (to hashbyte all the columns in a table, given a list of "ignore" columns, for example). While this might not be THE most efficient way to process all of the source systems, for us it was a no-brainer. Our DW server is fast enough that we can use this solution over and over, and we don't have to dig very deep into each source system to see whether or not we can trust the update date columns or whatever. We just use the same approach for each data source, and reuse the same code, or at least approach. Could we be loading some of our Fact tables faster if we didn't hash the rows and did column-by-column comparisons, and some of the other ones faster by finding that we actually can trust the update date column (maintained by a vendor, mind you, not our own code)? Undoubtedly. Would we still be working on one of the data sources I completed the ETL, Cubes, and Reports for 4 months ago? Undoubtedly.

So, while I love seeing the logic and discussion of the efficiencies discussed above, I think this is an efficiency that needs to be considered in at least some cases.

Cheers,

Rick Todd

Mike C SSC-Insane Points: 23224 More actions · Answer 4

Nadrek (4/19/2010)
Mike C (4/17/2010)
Not sure how you calculated the probability of collision at 2^40 off the top of your head... You can engineer an MD5 collision in 2^39 or less, but it's a pretty hard calculation to do without showing your work. In fact I'll post a very simple MD5 hash collision example shortly. This is why I recommend SHA-1 or better (probability of collision for 160-bit hash is 2^80 per the birthday paradox). Even the current crop of theoretical attacks on SHA-1 have only pushed the probability of forcibly generating an SHA-1 hash collision to 2^69 which is higher than the standard probability of collision with MD5. It's really hard to prove a negative and in most cases you're not going to save much by using a hash in a compromise solution to eliminate a few rows from full-blown multi-column comparisons. I'm putting together a blog entry to give folks who want to use hashes in this way a chance to prove their case.
I take it you haven't seen the McDonald, Hawkes and Pieprzyk paper from Eurocrypt 2009's rump session:
SHA-1 collisions now 2^52

Yes actually I did see that theoretical paper and I'm familiar with Joux and Peyrin's Amplified Boomerang attack that it's based on. In fact I'm running a contest over on the blog at http://www.sqlblog.com (full link in previous messages). There are no restrictions on the resources you can use. If you can turn McDonald, Hawkes and Pierprzyk's theoretical slideshow into a practical implementation that actually generates a hash collision using SHA-1 on SQL Server, send it in and you can win $100.

Mike C SSC-Insane Points: 23224 More actions · Answer 5

Rick Todd (4/19/2010)
I think there's a significant element missing from the "which way is faster to process" debate (hash then compare, or just compare): time to market. As in, how quickly can we code the solution. We're in the probably unusual, but far from unique situation where we have lots of data sources, but not a tremendous amount of data in each one, and in our case this is combined with small development resources. Let's call it a "wide and shallow" data warehouse. The result? We implemented something akin to Magarity's solution, with a couple of meta-data based SPs and TVFs that we can use over and over again (to hashbyte all the columns in a table, given a list of "ignore" columns, for example). While this might not be THE most efficient way to process all of the source systems, for us it was a no-brainer. Our DW server is fast enough that we can use this solution over and over, and we don't have to dig very deep into each source system to see whether or not we can trust the update date columns or whatever. We just use the same approach for each data source, and reuse the same code, or at least approach. Could we be loading some of our Fact tables faster if we didn't hash the rows and did column-by-column comparisons, and some of the other ones faster by finding that we actually can trust the update date column (maintained by a vendor, mind you, not our own code)? Undoubtedly. Would we still be working on one of the data sources I completed the ETL, Cubes, and Reports for 4 months ago? Undoubtedly.
So, while I love seeing the logic and discussion of the efficiencies discussed above, I think this is an efficiency that needs to be considered in at least some cases.
Cheers,

I agree. In fact, I've implemented solutions that use .NET in SSIS to generate hashes for both primary keys in the source (natural keys) and to detect changes across non-key columns at the same time, with similar capabilities to eliminate certain administrative columns as needed. Hashing the PKs was done for the reasons you mention above -- one source system may have a one-column PK while another system may have a five-column PK, yet we needed to be able to grab data from new source systems while minimizing customization. So this should be a big part of the conversation, and could be the determining factor in any given situation. The other benefits of generating a narrow single-column key (hash) from a multi-column source key, for instance, may well outweigh the overhead involved in generating the hash.

Nadrek SSC-Insane Points: 20039 More actions · Answer 6

Mike C (4/19/2010)
Nadrek (4/19/2010)
Mike C (4/17/2010)
Not sure how you calculated the probability of collision at 2^40 off the top of your head... You can engineer an MD5 collision in 2^39 or less, but it's a pretty hard calculation to do without showing your work. In fact I'll post a very simple MD5 hash collision example shortly. This is why I recommend SHA-1 or better (probability of collision for 160-bit hash is 2^80 per the birthday paradox). Even the current crop of theoretical attacks on SHA-1 have only pushed the probability of forcibly generating an SHA-1 hash collision to 2^69 which is higher than the standard probability of collision with MD5. It's really hard to prove a negative and in most cases you're not going to save much by using a hash in a compromise solution to eliminate a few rows from full-blown multi-column comparisons. I'm putting together a blog entry to give folks who want to use hashes in this way a chance to prove their case.
I take it you haven't seen the McDonald, Hawkes and Pieprzyk paper from Eurocrypt 2009's rump session:
SHA-1 collisions now 2^52
Yes actually I did see that theoretical paper and I'm familiar with Joux and Peyrin's Amplified Boomerang attack that it's based on. In fact I'm running a contest over on the blog at http://www.sqlblog.com (full link in previous messages). There are no restrictions on the resources you can use. If you can turn McDonald, Hawkes and Pierprzyk's theoretical slideshow into a practical implementation that actually generates a hash collision using SHA-1 on SQL Server, send it in and you can win $100.

I suspect I misinterpreted what you meant by the current crop of theoretical attacks increasing the probability of forcing a SHA-1 collision to 2^69, then.

Mike C SSC-Insane Points: 23224 More actions · Answer 7

Nadrek (4/19/2010)
Mike C (4/19/2010)
Nadrek (4/19/2010)
Mike C (4/17/2010)
Not sure how you calculated the probability of collision at 2^40 off the top of your head... You can engineer an MD5 collision in 2^39 or less, but it's a pretty hard calculation to do without showing your work. In fact I'll post a very simple MD5 hash collision example shortly. This is why I recommend SHA-1 or better (probability of collision for 160-bit hash is 2^80 per the birthday paradox). Even the current crop of theoretical attacks on SHA-1 have only pushed the probability of forcibly generating an SHA-1 hash collision to 2^69 which is higher than the standard probability of collision with MD5. It's really hard to prove a negative and in most cases you're not going to save much by using a hash in a compromise solution to eliminate a few rows from full-blown multi-column comparisons. I'm putting together a blog entry to give folks who want to use hashes in this way a chance to prove their case.
I take it you haven't seen the McDonald, Hawkes and Pieprzyk paper from Eurocrypt 2009's rump session:
SHA-1 collisions now 2^52
Yes actually I did see that theoretical paper and I'm familiar with Joux and Peyrin's Amplified Boomerang attack that it's based on. In fact I'm running a contest over on the blog at http://www.sqlblog.com (full link in previous messages). There are no restrictions on the resources you can use. If you can turn McDonald, Hawkes and Pierprzyk's theoretical slideshow into a practical implementation that actually generates a hash collision using SHA-1 on SQL Server, send it in and you can win $100.
I suspect I misinterpreted what you meant by the current crop of theoretical attacks increasing the probability of forcing a SHA-1 collision to 2^69, then.

Nope you didn't misinterpret. 2^69 was published by a Chinese research team, I believe about 5 years ago. McDonald and company withdrew their published paper that accompanied this slideshow ("Differential Path for SHA-1 with complexity O(2^52)") after they figured out their estimate was incorrect!

Link to withdrawal statement added: http://eprint.iacr.org/2009/259

magarity kerns SSCrazy Points: 2763 More actions · Answer 8

Look, the question of what exactly to use for an algorithm is a far off tangent. The main idea of my article is to use the narrow xref tables to store hashes (however you get them) instead of tacking them on to the wide base table like a lot of people do which no one seems to have noticed in the great debate over what formula to use. I personally don't care if you use SHA1 or CHECKSUM or anything else. It has nothing to do with the model.

As far as I'm concerned, whose collisions are theoretically less likely than whose is an implementation minutea that you can work out case by case.

Mike C SSC-Insane Points: 23224 More actions · Answer 9

magarity kerns (4/19/2010)
Look, the question of what exactly to use for an algorithm is a far off tangent. The main idea of my article is to use the narrow xref tables to store hashes (however you get them) instead of tacking them on to the wide base table like a lot of people do which no one seems to have noticed in the great debate over what formula to use. I personally don't care if you use SHA1 or CHECKSUM or anything else. It has nothing to do with the model.

Actually it's not as far off tangent as you think. Consider a simple hash function that simply grabs the first 4 characters of your inbound data. I don't believe this hash function would achieve a core desired result of detecting changes, regardless of which table you stored your hashes in. I may be wrong, however; some may find a simple/classic hash function like this might well be sufficient for their needs.

As far as I'm concerned, whose collisions are theoretically less likely than whose is an implementation minutea that you can work out case by case.

On occasion a little attention to detail can help avoid a ton of problems down the road.

BTW, the fact that people are still engaging in discussions of topics from your article for days after it was published is a good thing. FWIW, having written an article or two myself I've discovered that people often aren't interested in discussing the minutiae that I want them to on the talk pages of my articles.

Since this is the talk page for your article, however, I'll go ahead and un-follow this thread.

To anyone interested in continuing the conversation about the technical minutiae of collision-free hash functions, you can reach me over at the blog: http://www2.sqlblog.com/blogs/michael_coles/archive/2010/04/17/find-a-hash-collision-win-100.aspx.

ziangij SSCrazy Eights Points: 8066 More actions · Answer 10

I actually use checksum function for ETL jobs. Can anybody throw some light which one is better ? Checksum or Hashbytes...

Rick Todd Hall of Fame Points: 3302 More actions · Answer 11

ziangij, I would definitely recommend reading through some of the comments, but here's my summary:

If you have other logic to prevent/handle collisions, checksum might be fine. If you want to rely on the checksum/hashbytes method to do all of your new/update/existing logic, you definitely need hashbytes over checksum.

Rick Todd

ziangij SSCrazy Eights Points: 8066 More actions · Answer 12

Rick Todd (5/11/2010)
ziangij, I would definitely recommend reading through some of the comments, but here's my summary:
If you have other logic to prevent/handle collisions, checksum might be fine. If you want to rely on the checksum/hashbytes method to do all of your new/update/existing logic, you definitely need hashbytes over checksum.

apologies, i should have read all the comments which I have done now.

many options have been suggested with pros and cons of each... 🙂

Cyclone Mr or Mrs. 500 Points: 560 More actions · Answer 13

Hi ...

I am currently using a similiar method in order to compare data. The Hash value is calculated as follows:

HashBytes('SHA1',

UPPER(Isnull(convert(varchar(max), Branch), ''))

+ UPPER(Isnull(convert(varchar(max), VendorID), ''))

+ UPPER(Isnull(convert(varchar(max), VendorName), ''))

+ UPPER(Isnull(convert(varchar(max), VendorTradingName), ''))

+ UPPER(Isnull(convert(varchar(max), RegisteredCompany), ''))

+ UPPER(Isnull(convert(varchar(max), TypeOfEntity), ''))

+ UPPER(Isnull(convert(varchar(max), RegistrationNumber), ''))

+ ........ <other columns>

)

I convert each field to varchar datatype and concatenate them together in order generate a value.

Do you see any problems with this approach?

Also, I checked the length (using Len()) of the Hash Value and the max length I have so far is 20. I have the column that holds the Hash Value as datatype varbinary(max). Should I change the size of this column?

Thanks in advance.

Nadrek SSC-Insane Points: 20039 More actions · Answer 14

Assume you have vendorid and vendorname:

AB Abner

A Babner

Because you're converting to varchar, these both collapse to

ABABNER, so they look the same, while they clearly are not.

You have to convert to a fixed width, or put some delimiter in there (and be very certain that delimiter doesn't actually show up in your data).

magarity kerns SSCrazy Points: 2763 More actions · Answer 15

Cyclone (5/17/2010)
Do you see any problems with this approach?
Also, I checked the length (using Len()) of the Hash Value and the max length I have so far is 20. I have the column that holds the Hash Value as datatype varbinary(max). Should I change the size of this column?

Two problems: Why are you converting to upper case and you don't have a delimiter. 'AB' and 'A' will give the same as 'A' and 'BA' with your method. This has been hashed out (pun intended!) in prior conversations; please read the previous comments. It doesn't matter if the delimiter is in your data or not, just that there is something. You can just use a space. As long as something is there to break it up.

Converting to upper case is a little strange - why do it? Check with your SME or data steward to confirm your target system should ignore changes in case. No one here can give the definite answer, although I advise against it unless you have a requirement to do so.

The output is always 20 bytes for sha1 (and 16 for md5) but I don't know if the system needs extra processing or space to handle (max) instead of just (20). It probably doesn't matter.

The point of my article was the xref tables to hold checksums and keys; did anyone notice that at all in the debate over whose hash to use???