Incremental load using Hashvalue

Question

Incremental load using Hashvalue

Paul Hernández

SSCarpal Tunnel

Points: 4961
More actions
January 28, 2014 at 1:50 am

#282884

Hi everybody,
Background: we receive every day customer data, including new customer and changes to previous customer. The new customer are inserted in our CRM system using the cozyroc component. The changes in previous customer data must be analysed and only updated in CRM if the changes are relevant.
A relevant change is for us a change in a relevant column. The customer table, where we store the incoming data has 10 columns, and only 5 are relevant to the CRM system. So if we receive updates only in the non relevant columns we should discard these rows in our incremental load.
Initial solution: add a column to store a hash value with the sum of the relevant columns and then calculate the hash value of the incoming data in order to compare those values
hashvalue =
Convert(nvarchar(35),hashbytes('MD5', ISNULL(Col1,'') + ISNULL(Col2, '2000-01-01')+...ISNULL(Col5, '')))
The data type of the hashvalue column is nvarchar(35).
The strange thing is that the stored value are weird, something like " ೢﾑό䜨غ䟊"
and when new data come the new calculated hashvalue never matches the previous one
The question:
I try with a hashvalue column of type binary(16) and seems to work, but my question is why I cannot used the nvarchar solution?
It is also painful because I cannot compare binary values using a conditional split, and I have to convert again the binary values to DT_WSTR for the comparison.
Any comment would be appreciated.
Kind Regards,
Paul Hernández

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply

Nevyn SSChampion Points: 14210 More actions · Answer 1

I'm a bit confused as to why you need the hash value. Why not select those 5 records in your lookup, and test for changes to any of them in your conditional split?

Farlzy Mr or Mrs. 500 Points: 561 More actions · Answer 2

Paul,

It's isn't it the convert that throwing it off? When I've done this sort of thing before I've always had by hash as a computed column (obviously persisted and indexed appropriately). Such as the simplified example below.

I'm confused as to why you'd do a conditional split, would it not be more performant to do an update based on the key value where the hash doesn't match?

Applologies if I'm missing something.

IF OBJECT_ID('dbo.Emp') IS NOT NULL

DROP TABLE dbo.Emp

GO

CREATE TABLE dbo.Emp

(

Id INT PRIMARY KEY

,Forename VARCHAR(50)

,Surname VARCHAR(50)

,JobTitle VARCHAR(50)

,HashCol AS HASHBYTES('MD5',Forename + Surname + JobTitle)

)

GO

INSERT INTO dbo.Emp

VALUES(1,'Bob','Smith','IT Manager')

,(2,'Harry','Hill','DBa')

,(3,'Chris','Brown','Programmer')

GO

UPDATE dbo.Emp

SET Forename = 'Robert'

WHERE HASHBYTES('MD5', 'BobSmithIT Manager') = HashCol

GO

SELECT * FROM dbo.Emp

GO

Phil Parkin SSC Guru Points: 247125 More actions · Answer 3

Perhaps you could try this.

1) Send all of your potential updates to a staging table.

2) Do a Merge from the staging table to perform the update using a source something like

(Select * from staging)

Except

(Select relevant cols from target join staging on col = col)

The except condition should remove the identical rows without fuss.

Paul Hernández SSCarpal Tunnel Points: 4961 More actions · Answer 4

Hi Folks,

Thanks for your answers.

@Nevyn: my case is actually more complex, as an example I said 10 columns, 5 relevant and 5 non relevant, but I have a table with more than 40 columns, with at least 25 relevant columns

@Farlzy: that´s an elegant solution and I can solve almost everything in the database, I mean, more in the SQL Server and less in SSIS.

@Phil: you always give useful answers I have chosen your solution, in fact I already have a Staging DB to control the information exchange between our Data Mart and the CRM. I think your solution is not only simple to implement but also has a better performance. I´m a fan of the merge command.

Kind Regards,

Paul Hernández

sneumersky SSCertifiable Points: 7667 More actions · Answer 5

I agree with Phil's approach.

SELECT

EXCEPT

SELECT

is a tried and true way of doing a column specific "table diff". I use hashing as a last resort for extremely large and wide dimensions with many type-2 columns. Personally, I love Todd McDermid's SCD Transform (free on codeplex), but in this situation, simplicity rules...especially if there are not many on your team that know and can support SSIS.

Unfortunately, SQL Server does NOT include the ANSI standard "CORRESPONDING BY" functionality yet....maybe someday....

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 6

I've done this before, also with the Persisted computed column for only relevant fields, though I usually use a key-checksum combination.

What I typically do is load the key/checksum combination into a lookup component, then feed in changes from the source determined by their delta detection (usually a datetime field and a 'last run' field in my configurations) and determine those that don't match for the combination. In a lookup, the unmatched error rows can be easily re-directed, and those are what I feed to my staging table. The matching rows are allowed to rot in dev/null.

After that, once they're in the staging structure, I do what Phil recommends and MERGE those I know are changed. It just eases up on the processing and since I have to move the data between source and target anyway, a bit of transformation along the way is far more lightweight to the computations than processing 2-3x the rows on the target.

Just keep the lookup tight. If you're past 30 bytes across or so, it's far too large for easy cache'ing on umpteen million row targets, at least on my equipment.

EDIT: I should mention that yes, I know checksum can have varchar continuation concerns, where 'abc','def' is the same value as 'ab','cdef'. If I can't split my varchars with integers I'll usually append a pipe in the calculation function, so:

CheckSumVal AS CHECKSUM( VCol1, IntCol1, VCol2, '|' + VCol3, '|' + VCol4)

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

Bruce W Cassidy SSChampion Points: 12668 More actions · Answer 7

There's one issue with using the EXCEPT approach as the data source for a subsequent MERGE: it becomes impossible to tell from within the MERGE if rows have been deleted from the data source (as they are combined with rows that are identical across both the source and target). If that's okay (you may not care about deletes or you may know that the source never has deletes) then fine, but if not, you are left with the problem of dealing with deletes separately.

An alternative is to use MERGE itself to do the work for you. You end up with doing the following:

MERGE INTO B

USING A

ON A.key = B.key

WHEN MATCHED AND (

A.x is null and B.x is not null or

A.x is not null and B.x is null or

A.x <> B.x or

A.y is null and B.y is not null or

A.y is not null and B.y is null or

A.y <> B.y

)

THEN -- update

WHEN NOT MATCHED BY TARGET

THEN -- insert

WHEN NOT MATCHED BY SOURCE

THEN -- delete

Okay, so that's more code you have to write, unless you take into account the additional code you have to write to handle deletes using the other approach.

I generally don't favour calculating a checksum across the columns and comparing checksums. SQL Server has to calculate the checksum across all of the columns every time, whereas with the column by column comparison, it can "short circuit" the comparison as soon as it finds a mis-match. If you calculate the checksum while you are loading the data, it should improve the performance of the subsequent MERGE, but at the cost of a slower load and having to store the checksum; it's a trade-off. The approach with using MERGE to determine the changes will work either way.

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 8

Bruce W Cassidy (1/29/2014)
I generally don't favour calculating a checksum across the columns and comparing checksums. SQL Server has to calculate the checksum across all of the columns every time, whereas with the column by column comparison, it can "short circuit" the comparison as soon as it finds a mis-match.

A persisted calculated column is relatively lightweight, and doesn't require continuous computation. The key being it's persisted. Also INTs are small.

If you calculate the checksum while you are loading the data, it should improve the performance of the subsequent MERGE, but at the cost of a slower load and having to store the checksum; it's a trade-off. The approach with using MERGE to determine the changes will work either way.

True, it depends on change volume expectations of the columns required vs. the columns that aren't transferred. If it's low, I probably wouldn't pursue the removal of them. If it's high removing it entirely is relatively inexpensive.

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

sneumersky SSCertifiable Points: 7667 More actions · Answer 9

Checksum by itself is risky.

Checksum in combination with the business key is less risky.

HASHBYTES is the best for determining total and complete uniqueness if you are willing to deal with the overhead.

Many approaches exist for SCDs depending on data volumes, load windows, and a few other factors.

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 10

sneumersky (1/29/2014)
HASHBYTES is the best for determining total and complete uniqueness if you are willing to deal with the overhead.

From MSDN:

Return Value
--------------------------------------------------------------------------------
varbinary (maximum 8000 bytes)

If I'm going to possibly compare 8000 bytes I'm usually better off just doing the column compare, thus why I rarely use HASHBYTES.

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

sneumersky SSCertifiable Points: 7667 More actions · Answer 11

sneumersky

SSCertifiable

Points: 7667

January 30, 2014 at 9:38 am

#1684740

I've never had HASHBYTES return more than 14 bytes.

Incremental load using Hashvalue

Cookies on SQLServerCentral