April 13, 2010 at 2:27 am
I would not be surprised if this has come up one way or another, a million times – it seems like a fairly fundamental principle, but is there a standardised process for moving copies of tables periodically from point A to point B, where point B wants to keep an audit trail of all the changes in the table between refreshes? For the purpose of this exercise assume that Point B has no capacity to alter tables or data at point A.
To take an oversimplified example, say we have a table of four columns:
CREATE TABLE dbo.sample_tbl(
pk int NOT NULL,
fk int NOT NULL,
quantity numeric(8, 2) NOT NULL,
activity_date datetime NOT NULL,
CONSTRAINT PK_sample_tbl PRIMARY KEY CLUSTERED
(pk ASC) ON [PRIMARY]) ON [PRIMARY]
The first time it gets sent from point A to point B it contains three rows:
1, 1, 23.5, 20091230 10:00
2, 1, 25.5, 20091230 10:00
3, 1, 28.5, 20091230 10:00
The second time it looks like this:
1, 1, 23.5, 20091230 10:00
2, 1, 25.5, 20091230 10:00
3, 1, 28.5, 20091230 10:00
4, 1, 23.5, 20091231 10:00
Third time:
2, 1, 10.5, 20091230 10:00
3, 2, 28.5, 20091230 10:00
4, 1, 23.5, 20091231 12:00
What I’ve seen smarter people than me do is set up the table at point B so that it will have two extra fields: A change datetime and a change indicator (that flags INSERT, UPDATE, or DELETE). Each version of the row is inserted every time provided it is different from the same row sent previously.
CREATE TABLE dbo.sample_tbl_at_point_b(
pk int NOT NULL,
fk int NOT NULL,
quantity numeric(8, 2) NOT NULL,
activity_date datetime NOT NULL,
change_date datetime NOT NULL,
change_flag char(1) NOT NULL,
CONSTRAINT PK_sample_tbl_at_point_b PRIMARY KEY CLUSTERED
(pk ASC, change_date ASC) ON [PRIMARY]) ON [PRIMARY]
So after the first transmission, the data in dbo.sample_tbl_at_point_b at looks like:
1, 1, 23.5, 20091230 10:00, 20100101 20:10, I
2, 1, 25.5, 20091230 10:00, 20100101 20:10, I
3, 1, 28.5, 20091230 10:00, 20100101 20:10, I
All the incoming rows are dated and flagged as inserts because they’re all new.
After the second:
1, 1, 23.5, 20091230 10:00, 20100101 20:10, I
2, 1, 25.5, 20091230 10:00, 20100101 20:10, I
3, 1, 28.5, 20091230 10:00, 20100101 20:10, I
4, 1, 23.5, 20091231 12:00, 20100102 18:10, I
Even though four rows were sent, only one is new (PK4) so it is inserted, flagged as an INSERT and the rest are ignored.
After the third transmission:
1, 1, 23.5, 20091230 10:00, 20100101 20:10, I
1, 1, 23.5, 20091230 10:00, 20100103 19:10, D
2, 1, 25.5, 20091230 10:00, 20100101 20:10, I
3, 1, 28.5, 20091230 10:00, 20100101 20:10, I
3, 2, 28.5, 20091230 10:00, 20100103 19:10, U
4, 1, 23.5, 20091231 12:00, 20100102 18:10, I
The inserts are only performed where something has changed. If an identical row of data comes through, it’s ignored.
For an end user to get at the current version of point A’s table they might use a view like this:
;
with latest_vers as
(
select
pk
,max(change_date) as change_date
from
dbo.sample_tbl_at_point_b
group by
pk
)
select
sam.pk
,sam.fk
,sam.quantity
from
dbo.sample_tbl_at_point_b sam
join latest_vers lat
on sam.pk = lat.pk
and
sam.change_date = lat.change_date
where
sam.change_flag <> 'D';
What this process should theoretically enable people to do, is recreate sample_tbl as it appeared at any given point in time, so that when we run a report today, about activity last year, we can explain exactly why we might get different answers. And we can do so without extravangant duplication of data.
This is an awfully long-winded way to get around to my questions. In my real-world (i.e. non-simplified) situation we are dealing with tens of millions of rows, and say twenty five columns. More than 99% of the rows in any given transmission from sample_tbl will be identical to those already in sample_tbl_at_point_b. Is there an established “best practise” for quickly identifying the INSERTs UPDATEs and DELETEs? The thing I get the least is how to efficiently pick the differences in nonPK columns of a given row. As I understand it (which is to say, not very much), CHECKSUM is no good because you can get the same CHECKSUM result from different data (or maybe I don't understand CHECKSUM properly). I’d be grateful for pointers on how other people have dealt with this situation.
April 13, 2010 at 3:24 am
The simplest way to know of an entire row has been updated, inserted or deleted is by using a creation_datetime column and a delete_indicator (or a current_indicator, whatever works for you).
It's three extra columns, and they are not hard to maintain.
However, if you want to know if a row has been updated and also which column was updated, then things get complicated. A checksum is in this case certainly not the right solution, as it only tells you that the row has changed, but you have no way of finding out which column. In this case I would implement the table as a Kimball slowly changing dimension (SCD), with a Start_Date and an End_Date (actually, the End_Date is not necessary, as this blog article points out: http://sqlblog.com/blogs/jamie_thomson/archive/2009/11/28/debunking-kimball-effective-dates.aspx). Sure, you'll have a new row for each update, but there has to be some cost when you want to know everything...
Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP
April 13, 2010 at 5:25 am
"...is by using a creation_datetime column and a delete_indicator..."
except that as indicated earlier, at point B, where they want to know what's changed and (roughly) when, they don't got no control of the table at point A, so there's no chance of getting the create date and void indicator added there. Or I've completely misunderstood you! Thanks for the link. I'll have a look and see what it reveals.
Cheers.
Viewing 3 posts - 1 through 2 (of 2 total)
You must be logged in to reply to this topic. Login to reply