Checksum value for finding changed data advice

  • I'm using SSIS (but you could use this in straight TSQL) to compare fields from two rows of data - say on a basic Person Table, just Firstname, Lastname, Middlename.

    CHECKSUM(Firstname, Lastname, Middlename)

    running this on both tables allows you to compare a single int instead of 3 text fields in each table.

    I suppose my question would be if it would be better to run this checksum in a select statement or create a field for it in the table? It seems like it would perform the entire compare faster if it were stored in the table, but then you would need to create an update/insert trigger on each table to make sure the checksum is up to date.

    Right now I'm not sure I want to go adding 30 triggers and adding 15 fields to 15 tables to help speed up an overnight import that will only be run for a limited time (the next 6 mo most likely).

  • Shawn Therrien (5/14/2009)


    I'm using SSIS (but you could use this in straight TSQL) to compare fields from two rows of data - say on a basic Person Table, just Firstname, Lastname, Middlename.

    CHECKSUM(Firstname, Lastname, Middlename)

    running this on both tables allows you to compare a single int instead of 3 text fields in each table.

    I suppose my question would be if it would be better to run this checksum in a select statement or create a field for it in the table? It seems like it would perform the entire compare faster if it were stored in the table, but then you would need to create an update/insert trigger on each table to make sure the checksum is up to date.

    Right now I'm not sure I want to go adding 30 triggers and adding 15 fields to 15 tables to help speed up an overnight import that will only be run for a limited time (the next 6 mo most likely).

    If I were going to do it, I'd just add a persisted calculated column to the table. The problem with CHECKSUM is, although fairly rare, it can allow two different values to have identical checksums... it's really nothing more than an EOR (Exclusive OR) of the values being checked.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • [font="Verdana"]A persisted value can help. However, from experience... don't bother -- the performance is too sucky. You're better off just comparing all of the individual columns. SQL Server can short circuit the comparisons where possible and miss having to perform calculations with some of the data. In comparison, for checksums and the like, it basically has to calculate a value across all of the columns. So between the time for the calculation and guaranteeing checking all of the data, it's quicker just to do the column by column comparison.

    BTW, here's the snippet from books online about CHECKSUM().

    CHECKSUM satisfies the properties of a hash function: CHECKSUM applied over any two lists of expressions returns the same value if the corresponding elements of the two lists have the same type and are equal when compared using the equals (=) operator. For this definition, null values of a specified type are considered to compare as equal. If one of the values in the expression list changes, the checksum of the list also generally changes. However, there is a small chance that the checksum will not change. For this reason, we do not recommend using CHECKSUM to detect whether values have changed, unless your application can tolerate occasionally missing a change. Consider using HashBytes instead. When an MD5 hash algorithm is specified, the probability of HashBytes returning the same result for two different inputs is much lower than that of CHECKSUM.

    So not even HashBytes() will guarantee no value collisions![/font]

  • Bruce W Cassidy (5/14/2009)


    [font="Verdana"]A persisted value can help. However, from experience... don't bother -- the performance is too sucky.

    The performance on a persisted calculated column is "sucky"?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (5/14/2009)


    The performance on a persisted calculated column is "sucky"?

    [font="Verdana"]Yup. You persist it by creating an index over it, right? So when you do a bulk set update or insert, you have to add in the time to change the index.

    Sucky.

    [/font]

  • Bruce W Cassidy (5/14/2009)


    Jeff Moden (5/14/2009)


    The performance on a persisted calculated column is "sucky"?

    [font="Verdana"]Yup. You persist it by creating an index over it, right? So when you do a bulk set update or insert, you have to add in the time to change the index.

    Sucky.

    [/font]

    Oh... ok...I misunderstood where you were using it. Nah... add the persisted column to the final table, not the staging table that you're doing the bulk insert into... or at least not before you do the bulk insert.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (5/14/2009)


    add the persisted column to the final table, not the staging table that you're doing the bulk insert into... or at least not before you do the bulk insert.

    [font="Verdana"]That's exactly the approach we took. However, we found that it was still quicker overall to just compare column to column.

    I guess it would depend on the amount of change you are getting in the data, and the volumes of data. But my recommendation would be to test the performance with the computed column (assuming you are okay with missing some changes) and also test the performance just comparing column to column.

    From memory (er, this was about six months ago, so I'm only talking ballpark figures), we were looking at:

    Input data (updates and inserts): around 500,000 a day.

    Changes in data (for updates): around 2-3 fields on average (generally a change in address).

    Existing table: around 10 million entries.

    The update time dropped by 20 minutes once we stopped using checksum.

    [/font]

  • Got it and mostly agree especially since CHECKSUM is not a guarantee that something has changed or not.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (5/14/2009)


    Got it and mostly agree especially since CHECKSUM is not a guarantee that something has changed or not.

    [font="Verdana"]We were actually using BINARY_CHECKSUM(), but it has the same issues.

    HashBytes might be better for detecting changes, but the syntax is a bit awkward. Basically you have to pass it one string to be hashed (rather than a list of columns). We didn't test the performance of that though, so maybe it would be better?

    [/font]

  • Haven't need to try it. I'm pretty fortunate in that most of the data providers I "meat" up with respond pretty well at the mere threat of a malstrome of high velocity pork chops if they don't include a "LastModifiedDate" in the data they provide. 😛

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply