Incremental load using Hashvalue

  • Hi everybody,

    Background: we receive every day customer data, including new customer and changes to previous customer. The new customer are inserted in our CRM system using the cozyroc component. The changes in previous customer data must be analysed and only updated in CRM if the changes are relevant.

    A relevant change is for us a change in a relevant column. The customer table, where we store the incoming data has 10 columns, and only 5 are relevant to the CRM system. So if we receive updates only in the non relevant columns we should discard these rows in our incremental load.

    Initial solution: add a column to store a hash value with the sum of the relevant columns and then calculate the hash value of the incoming data in order to compare those values

    hashvalue =

    Convert(nvarchar(35),hashbytes('MD5', ISNULL(Col1,'') + ISNULL(Col2, '2000-01-01')+...ISNULL(Col5, '')))

    The data type of the hashvalue column is nvarchar(35).

    The strange thing is that the stored value are weird, something like " ೢムό䜨غ䟊"

    and when new data come the new calculated hashvalue never matches the previous one

    The question:

    I try with a hashvalue column of type binary(16) and seems to work, but my question is why I cannot used the nvarchar solution?

    It is also painful because I cannot compare binary values using a conditional split, and I have to convert again the binary values to DT_WSTR for the comparison.

    Any comment would be appreciated.

    Kind Regards,

    Paul Hernández
  • I'm a bit confused as to why you need the hash value. Why not select those 5 records in your lookup, and test for changes to any of them in your conditional split?

  • Paul,

    It's isn't it the convert that throwing it off? When I've done this sort of thing before I've always had by hash as a computed column (obviously persisted and indexed appropriately). Such as the simplified example below.

    I'm confused as to why you'd do a conditional split, would it not be more performant to do an update based on the key value where the hash doesn't match?

    Applologies if I'm missing something.

    IF OBJECT_ID('dbo.Emp') IS NOT NULL

    DROP TABLE dbo.Emp

    GO

    CREATE TABLE dbo.Emp

    (

    Id INT PRIMARY KEY

    ,Forename VARCHAR(50)

    ,Surname VARCHAR(50)

    ,JobTitle VARCHAR(50)

    ,HashCol AS HASHBYTES('MD5',Forename + Surname + JobTitle)

    )

    GO

    INSERT INTO dbo.Emp

    VALUES(1,'Bob','Smith','IT Manager')

    ,(2,'Harry','Hill','DBa')

    ,(3,'Chris','Brown','Programmer')

    GO

    UPDATE dbo.Emp

    SET Forename = 'Robert'

    WHERE HASHBYTES('MD5', 'BobSmithIT Manager') = HashCol

    GO

    SELECT * FROM dbo.Emp

    GO

  • Perhaps you could try this.

    1) Send all of your potential updates to a staging table.

    2) Do a Merge from the staging table to perform the update using a source something like

    (Select * from staging)

    Except

    (Select relevant cols from target join staging on col = col)

    The except condition should remove the identical rows without fuss.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • Hi Folks,

    Thanks for your answers.

    @Nevyn: my case is actually more complex, as an example I said 10 columns, 5 relevant and 5 non relevant, but I have a table with more than 40 columns, with at least 25 relevant columns

    @Farlzy: that´s an elegant solution and I can solve almost everything in the database, I mean, more in the SQL Server and less in SSIS.

    @Phil: you always give useful answers 🙂 I have chosen your solution, in fact I already have a Staging DB to control the information exchange between our Data Mart and the CRM. I think your solution is not only simple to implement but also has a better performance. I´m a fan of the merge command.

    Kind Regards,

    Paul Hernández
  • I agree with Phil's approach.

    SELECT

    EXCEPT

    SELECT

    is a tried and true way of doing a column specific "table diff". I use hashing as a last resort for extremely large and wide dimensions with many type-2 columns. Personally, I love Todd McDermid's SCD Transform (free on codeplex), but in this situation, simplicity rules...especially if there are not many on your team that know and can support SSIS.

    Unfortunately, SQL Server does NOT include the ANSI standard "CORRESPONDING BY" functionality yet....maybe someday....

  • I've done this before, also with the Persisted computed column for only relevant fields, though I usually use a key-checksum combination.

    What I typically do is load the key/checksum combination into a lookup component, then feed in changes from the source determined by their delta detection (usually a datetime field and a 'last run' field in my configurations) and determine those that don't match for the combination. In a lookup, the unmatched error rows can be easily re-directed, and those are what I feed to my staging table. The matching rows are allowed to rot in dev/null. 🙂

    After that, once they're in the staging structure, I do what Phil recommends and MERGE those I know are changed. It just eases up on the processing and since I have to move the data between source and target anyway, a bit of transformation along the way is far more lightweight to the computations than processing 2-3x the rows on the target.

    Just keep the lookup tight. If you're past 30 bytes across or so, it's far too large for easy cache'ing on umpteen million row targets, at least on my equipment.

    EDIT: I should mention that yes, I know checksum can have varchar continuation concerns, where 'abc','def' is the same value as 'ab','cdef'. If I can't split my varchars with integers I'll usually append a pipe in the calculation function, so:

    CheckSumVal AS CHECKSUM( VCol1, IntCol1, VCol2, '|' + VCol3, '|' + VCol4)


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • There's one issue with using the EXCEPT approach as the data source for a subsequent MERGE: it becomes impossible to tell from within the MERGE if rows have been deleted from the data source (as they are combined with rows that are identical across both the source and target). If that's okay (you may not care about deletes or you may know that the source never has deletes) then fine, but if not, you are left with the problem of dealing with deletes separately.

    An alternative is to use MERGE itself to do the work for you. You end up with doing the following:

    MERGE INTO B

    USING A

    ON A.key = B.key

    WHEN MATCHED AND (

    A.x is null and B.x is not null or

    A.x is not null and B.x is null or

    A.x <> B.x or

    A.y is null and B.y is not null or

    A.y is not null and B.y is null or

    A.y <> B.y

    )

    THEN -- update

    WHEN NOT MATCHED BY TARGET

    THEN -- insert

    WHEN NOT MATCHED BY SOURCE

    THEN -- delete

    Okay, so that's more code you have to write, unless you take into account the additional code you have to write to handle deletes using the other approach.

    I generally don't favour calculating a checksum across the columns and comparing checksums. SQL Server has to calculate the checksum across all of the columns every time, whereas with the column by column comparison, it can "short circuit" the comparison as soon as it finds a mis-match. If you calculate the checksum while you are loading the data, it should improve the performance of the subsequent MERGE, but at the cost of a slower load and having to store the checksum; it's a trade-off. The approach with using MERGE to determine the changes will work either way.

  • Bruce W Cassidy (1/29/2014)


    I generally don't favour calculating a checksum across the columns and comparing checksums. SQL Server has to calculate the checksum across all of the columns every time, whereas with the column by column comparison, it can "short circuit" the comparison as soon as it finds a mis-match.

    A persisted calculated column is relatively lightweight, and doesn't require continuous computation. The key being it's persisted. Also INTs are small.

    If you calculate the checksum while you are loading the data, it should improve the performance of the subsequent MERGE, but at the cost of a slower load and having to store the checksum; it's a trade-off. The approach with using MERGE to determine the changes will work either way.

    True, it depends on change volume expectations of the columns required vs. the columns that aren't transferred. If it's low, I probably wouldn't pursue the removal of them. If it's high removing it entirely is relatively inexpensive.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Checksum by itself is risky.

    Checksum in combination with the business key is less risky.

    HASHBYTES is the best for determining total and complete uniqueness if you are willing to deal with the overhead.

    Many approaches exist for SCDs depending on data volumes, load windows, and a few other factors.

  • sneumersky (1/29/2014)


    HASHBYTES is the best for determining total and complete uniqueness if you are willing to deal with the overhead.

    From MSDN:

    Return Value

    --------------------------------------------------------------------------------

    varbinary (maximum 8000 bytes)

    If I'm going to possibly compare 8000 bytes I'm usually better off just doing the column compare, thus why I rarely use HASHBYTES.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • I've never had HASHBYTES return more than 14 bytes.

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply