Comparing Datasets to create an "Update" File

  • I've created a table consisting of one record per customer, with a multitude of

    customer attributes, any of which can change from one day to the next. The

    job is run daily, creating a new table. Is there a fairly simple way to create a file

    consisting of only those records (customers) who have a change from the previous

    extract using T-SQL? Seems like each field needs to be compared between the

    tables, and with 70+ columns, it's not sounding like alot of fun......

    Thanks

  • Sometimes you just have to do the work. You can write a script against information_schema.columns to generate the code you need.

    The following is the general logic you need to do a complete test of changes. You can eliminate the NULL test for columns that are not nullable.

    where

    (a.col1 <> b.col1 or a.col1 is null and b.col1 is not null or a.col1 is not null and b.col1 is null)

    or

    … and so on for each column…

  • how about

    Create table #t1(col1 int primary key, col2 int, col3 int)

    Insert INTO #t1

    Select 1,1,1

    union all Select 2,1,2

    union all Select 3,1,3

    Create table #t2(col1 int primary key, col2 int , col3 int)

    Insert INTO #t2

    Select 1,1,1

    union all Select 2,1,2

    union all Select 3,1,4

    Select *

    From

    (

    Select col1, checksum(*) t1_checksum

    From #t1

    ) t1

    inner join

    (

    Select col1, checksum(*) t2_checksum

    From #t2

    ) t2 on t1.col1 = t2.col1

    where t1.t1_checksum <> t2.t2_checksum

    drop table #t1

    drop table #t2

  • Thanks for the replies......I've never used checksum before, but it sounds like it might be

    the way to go --- I'll have to try it out.

  • Jeremy (4/7/2008)


    how about

    Create table #t1(col1 int primary key, col2 int, col3 int)

    Insert INTO #t1

    Select 1,1,1

    union all Select 2,1,2

    union all Select 3,1,3

    Create table #t2(col1 int primary key, col2 int , col3 int)

    Insert INTO #t2

    Select 1,1,1

    union all Select 2,1,2

    union all Select 3,1,4

    Select *

    From

    (

    Select col1, checksum(*) t1_checksum

    From #t1

    ) t1

    inner join

    (

    Select col1, checksum(*) t2_checksum

    From #t2

    ) t2 on t1.col1 = t2.col1

    where t1.t1_checksum <> t2.t2_checksum

    drop table #t1

    drop table #t2

    There is a small but real possibility that two rows that are different can have the same checksum,

  • Thanks for the feedback. If something seems too good to be true, something must be wrong with it 🙂

    So how small of a chance is it that they could be the same? Is it sort of like hashing where you could get more and more "collisions" as your dataset grows?

  • There are a few options available to you that can simplify this type of thing. However, I can't say which would be best for your situation without considerably more detail.

    Have you considered a "timestamp"? There are many ways to implement this, such as a timestamp (rowversion) column, a datetime column updated through a trigger, so on and so forth. These methods allow you to pull rows that have been updated from a point in time, though you won't know if the data actually changed.

    CHECKSUM(), BINARY_CHECKSUM() or even HASHBYTES() are options that allow you to know if the data in a row has actually changed. There are some caveats to the usage of these functions, so do your research to know which is best for you. This solution allows you to pull rows, or sets of columns, that have changed by comparing your KEY and checksum value from your different sources.

    Either of these methods can be optimized with indexes, which is not usually feasible with column - column compares.

    GL!

    Edit: Just realized I was a bit slow on posting and plenty of others suggested some of the same. Oops.

  • The trouble with checksum is that a difference in the checksum means that the rows are different, but identical checksums do not mean that they are the same.

    It is not a solution for your problem unless you do not care about 100% accuracy.

  • Are you in a position where you can add a "timestamp" column to the table you want to track changes on? (Look up Timestamp in Books Online. It has nothing to do with date and time.)

    Every time a row is updated, if it has a "timestamp" column, the "timestamp" column is updated. That means, if you recorded the timestamp in a log table, and you join on "MainTable.ID = LogTable.ID and MainTable.timestamp > LogTable.timestamp", that will give you rows that have been updated.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Hi GSquared,

    Thanks for your input....I think I understand the timestamp (system-generated number) concept,

    but I'm not following how this would work in my case (and it may be just because I'm more

    used to "general querying"). Is it possible to UPDATE an existing table with another table,

    but programmatically *only* update the fields/columns that are different? That way, as you say,

    the timestamp would stay the same if no values changed in the row, and at the end of the process

    I could query for those rows where the timestamp had changed, and create the "update" file.

    Again, here is my desired goal:

    Day1 ==> Run query to create TableA

    Day2 ==> Run query to create TableB

    Compare TableA and TableB (same field definitions). Some new customers may have come

    on board on Day2, therefore some new records. Some existing customers may have updated

    their addresses, purchased, etc.

    Any changes to a customer record from Day1 to Day2 creates a record in UpdateFile.

    Any new customer records in Day2 creates a record in UpdateFile.

    Sorry if this is basic stuff......

  • If you have the timestamp from table A stored (varbinary) in table B, then you can compare those to find out which rows have been updated, and then update the rows.

    Something like:

    update TableB

    set Column1 = TableA.Column1,

    Column2 = TableA.Column2,

    ...

    TableATimestamp = TableA.Timestamp

    from TableA

    where TableB.ID = TableA.ID

    and TableB.TableATimestamp != TableA.Timestamp

    That should handle the updated rows and leave the ones that haven't been updated alone. Then it's just an Insert ... Select from TableA into TableB where the ID isn't in TableB.

    There are various other versions of "upsert" (as it's called), and they all have various advantages/drawbacks, but any of them should work. Search this site and Google/MSN/Yahoo/etc., for "SQL Server upsert" and you'll find plenty of material on that.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Thanks GSquared.....I'll try this out and see if it's what I'm looking for. Sounds promising!

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply