Comparing Datasets to create an "Update" File

Question

Comparing Datasets to create an "Update" File

Kjell-415108

SSC-Addicted

Points: 468
More actions
April 7, 2008 at 9:39 am

#185108

I've created a table consisting of one record per customer, with a multitude of
customer attributes, any of which can change from one day to the next. The
job is run daily, creating a new table. Is there a fairly simple way to create a file
consisting of only those records (customers) who have a change from the previous
extract using T-SQL? Seems like each field needs to be compared between the
tables, and with 70+ columns, it's not sounding like alot of fun......
Thanks

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply

Michael Valentine Jones SSC Guru Points: 64818 More actions · Answer 1

Sometimes you just have to do the work. You can write a script against information_schema.columns to generate the code you need.

The following is the general logic you need to do a complete test of changes. You can eliminate the NULL test for columns that are not nullable.

where

(a.col1 <> b.col1 or a.col1 is null and b.col1 is not null or a.col1 is not null and b.col1 is null)

or

… and so on for each column…

Jeremy-475548 Hall of Fame Points: 3065 More actions · Answer 2

how about

Create table #t1(col1 int primary key, col2 int, col3 int)

Insert INTO #t1

Select 1,1,1

union all Select 2,1,2

union all Select 3,1,3

Create table #t2(col1 int primary key, col2 int , col3 int)

Insert INTO #t2

Select 1,1,1

union all Select 2,1,2

union all Select 3,1,4

Select *

From

(

Select col1, checksum(*) t1_checksum

From #t1

) t1

inner join

(

Select col1, checksum(*) t2_checksum

From #t2

) t2 on t1.col1 = t2.col1

where t1.t1_checksum <> t2.t2_checksum

drop table #t1

drop table #t2

Kjell-415108 SSC-Addicted Points: 468 More actions · Answer 3

Thanks for the replies......I've never used checksum before, but it sounds like it might be

the way to go --- I'll have to try it out.

Michael Valentine Jones SSC Guru Points: 64818 More actions · Answer 4

Jeremy (4/7/2008)
how about
Create table #t1(col1 int primary key, col2 int, col3 int)
Insert INTO #t1
Select 1,1,1
union all Select 2,1,2
union all Select 3,1,3
Create table #t2(col1 int primary key, col2 int , col3 int)
Insert INTO #t2
Select 1,1,1
union all Select 2,1,2
union all Select 3,1,4
Select *
From
(
Select col1, checksum(*) t1_checksum
From #t1
) t1
inner join
(
Select col1, checksum(*) t2_checksum
From #t2
) t2 on t1.col1 = t2.col1
where t1.t1_checksum <> t2.t2_checksum
drop table #t1
drop table #t2

There is a small but real possibility that two rows that are different can have the same checksum,

Jeremy-475548 Hall of Fame Points: 3065 More actions · Answer 5

Thanks for the feedback. If something seems too good to be true, something must be wrong with it 🙂

So how small of a chance is it that they could be the same? Is it sort of like hashing where you could get more and more "collisions" as your dataset grows?

John Beggs SSCommitted Points: 1921 More actions · Answer 6

There are a few options available to you that can simplify this type of thing. However, I can't say which would be best for your situation without considerably more detail.

Have you considered a "timestamp"? There are many ways to implement this, such as a timestamp (rowversion) column, a datetime column updated through a trigger, so on and so forth. These methods allow you to pull rows that have been updated from a point in time, though you won't know if the data actually changed.

CHECKSUM(), BINARY_CHECKSUM() or even HASHBYTES() are options that allow you to know if the data in a row has actually changed. There are some caveats to the usage of these functions, so do your research to know which is best for you. This solution allows you to pull rows, or sets of columns, that have changed by comparing your KEY and checksum value from your different sources.

Either of these methods can be optimized with indexes, which is not usually feasible with column - column compares.

GL!

Edit: Just realized I was a bit slow on posting and plenty of others suggested some of the same. Oops.

Michael Valentine Jones SSC Guru Points: 64818 More actions · Answer 7

The trouble with checksum is that a difference in the checksum means that the rows are different, but identical checksums do not mean that they are the same.

It is not a solution for your problem unless you do not care about 100% accuracy.

GSquared SSC Guru Points: 260824 More actions · Answer 8

Are you in a position where you can add a "timestamp" column to the table you want to track changes on? (Look up Timestamp in Books Online. It has nothing to do with date and time.)

Every time a row is updated, if it has a "timestamp" column, the "timestamp" column is updated. That means, if you recorded the timestamp in a log table, and you join on "MainTable.ID = LogTable.ID and MainTable.timestamp > LogTable.timestamp", that will give you rows that have been updated.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Kjell-415108 SSC-Addicted Points: 468 More actions · Answer 9

Hi GSquared,

Thanks for your input....I think I understand the timestamp (system-generated number) concept,

but I'm not following how this would work in my case (and it may be just because I'm more

used to "general querying"). Is it possible to UPDATE an existing table with another table,

but programmatically *only* update the fields/columns that are different? That way, as you say,

the timestamp would stay the same if no values changed in the row, and at the end of the process

I could query for those rows where the timestamp had changed, and create the "update" file.

Again, here is my desired goal:

Day1 ==> Run query to create TableA

Day2 ==> Run query to create TableB

Compare TableA and TableB (same field definitions). Some new customers may have come

on board on Day2, therefore some new records. Some existing customers may have updated

their addresses, purchased, etc.

Any changes to a customer record from Day1 to Day2 creates a record in UpdateFile.

Any new customer records in Day2 creates a record in UpdateFile.

Sorry if this is basic stuff......

GSquared SSC Guru Points: 260824 More actions · Answer 10

If you have the timestamp from table A stored (varbinary) in table B, then you can compare those to find out which rows have been updated, and then update the rows.

Something like:

update TableB

set Column1 = TableA.Column1,

Column2 = TableA.Column2,

...

TableATimestamp = TableA.Timestamp

from TableA

where TableB.ID = TableA.ID

and TableB.TableATimestamp != TableA.Timestamp

That should handle the updated rows and leave the ones that haven't been updated alone. Then it's just an Insert ... Select from TableA into TableB where the ID isn't in TableB.

There are various other versions of "upsert" (as it's called), and they all have various advantages/drawbacks, but any of them should work. Search this site and Google/MSN/Yahoo/etc., for "SQL Server upsert" and you'll find plenty of material on that.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Kjell-415108 SSC-Addicted Points: 468 More actions · Answer 11

Thanks GSquared.....I'll try this out and see if it's what I'm looking for. Sounds promising!