May 11, 2006 at 9:57 am
I am trying to find a set based solution to retrieve the first not null value for each column in a time ordered set. Cursors and UDF are excluded for performance reasons. I considered a UDA in C# but the IsInvariantToOrder attribut is not yet supported by the query processor and UDA do not support varchar(MAX) and XML types yet.
I have audit/history tables structured like this:
Create
table LogTable( PK int primary key clustered,
UpdatedTS datetime NOT NULL,
UpdatedBy varchar(30) NOT NULL,
Col1 varchar(30) NULL,
Col2 int NULL,
ColN int NULL)
a) Inserts are not logged
b) only updated columns are logged
c) a deleted row is fully logged
This enables to have no impact on inserts and few impact on updates. The UpdatedTS and UpdatedBy columns are also in the primary table. This minimizes IO. The usual log structure (pk, date, user, old valu, new value) has too much IO overhead and must log everything as strings.
When I am slicing the table to retrieve a row version at a specific time, for each NULL column I need to retrieve the most recent not null value (if they are all null, the value comes from the primary table since that column was never updated)
I have been scratching my head to find a good set based solution for this. Anybody has a suggestion?
May 11, 2006 at 11:09 am
Something like :
Select ColName, NewValue from dbo.LogTable LT inner join
(Select Max(PK), ColName FROM dbo.LogTable group by ColName) dtLastVersion
on LT.pk = dtLastVersion.PK
Once you have that, have the application or calling proc figure out what missing data there is and send it out to the user. Also you must take into consideration that a null value can be a valid VALUE and not only missing data in the history.
That's why I always opt for a copy of the original table when I do a log. That way you can retrieve the whole row in a single step.
May 11, 2006 at 11:36 am
Thanks for replying. I am doing a delta instead of a full copy because copying the full row would kill the server. We get thousands of tran per minutes. Some tables are wide and have LOBs in them.
Your solution does not work (the way I interpret it). The PK of the row doesn't change, just the timestamp. The MAX(PK) makes event less sense to me. I am not trying to output an OldValue/NewValue but a complete row for each point in time.
I've tried several approaches, but nothing truly elegant yet. My current experiment is to use a recursive CTE to fully materialize each time slice, but I am still struggling with it. A bit of challenge to brigthen the day
May 11, 2006 at 11:42 am
ok here's a modified version :
Select ColName, Value from dbo.LogTable LT inner join
(Select Max(UpdatedTS), ColName FROM dbo.LogTable where Pk = @pk group by ColName) dtLastVersion
on LT.UpdatedTS = dtLastVersion.UpdatedTS and LT.PK = @pk
This will give you something like :
LName = John
FName = Smith
...
You'll have to pivot that into a single row to have the latest version. But then again you might have missing columns in there because they might not have been updated at all since the insert.
May 11, 2006 at 11:46 am
... and if you're trying to rebuild the whole history of the row then I have no elegant solution to offer. You might be able to pivot the changes on the date to know all the columns that were updated in a single shot. But atfter that, without a starting point you'll have to go back in time to rebuild the history backwards from the current data. I can't think of a simple solution for this one.
May 11, 2006 at 11:51 am
I don't know how you guys can work through these things without examples. I find it impossible .
RGR - I don't get what you're getting at - could you post it with some example data. I've invented some below if that helps.
lakusha - Interesting problem! Is the example data I've invented sufficient to illustrate your problem, and is the target I've given correct? If not, would you please provide an example and the target results.
I've come up with 2 ways. The first is highly inelegant and inefficient, I'd say, and the second is semi-elegant . i.e. neither is ideal!
Take a look and see what you think...
--data
declare @LogTable table ( PK int primary key clustered, UpdatedTS datetime NOT NULL,
UpdatedBy varchar(30) NOT NULL, Col1 varchar(30) NULL, Col2 int NULL, ColN int NULL)
insert @LogTable
select 1, '20060101', 'A', null, 1, null
union all select 2, '20060102', 'A', null, 2, null
union all select 3, '20060103', 'A', 'C', null, 4
union all select 4, '20060104', 'A', 'B', null, null
--target: 'B', 2, 4
--calculation (one idea)
select
(select Col1 from @LogTable where UpdatedTS = (select max(UpdatedTS) from @LogTable where Col1 is not null)),
(select Col2 from @LogTable where UpdatedTS = (select max(UpdatedTS) from @LogTable where Col2 is not null)),
(select ColN from @LogTable where UpdatedTS = (select max(UpdatedTS) from @LogTable where ColN is not null))
--calculation (another idea)
declare @Col1 varchar(30), @Col2 int, @ColN int
select
@Col1 = case when @Col1 is null and Col1 is not null then Col1 else @Col1 end,
@Col2 = case when @Col2 is null and Col2 is not null then Col2 else @Col2 end,
@ColN = case when @ColN is null and ColN is not null then ColN else @ColN end
from @LogTable
order by UpdatedTS desc
select @Col1, @Col2, @ColN
Ryan Randall
Solutions are easy. Understanding the problem, now, that's the hard part.
May 11, 2006 at 11:55 am
Sorry, emergency time here... no more time on ssc this week...
May 11, 2006 at 12:33 pm
Sorry about the lack of examples Ryan. I posted the table structure but I had so many alternatives I had tried it would have been confusing. Some of them worked but were, urg, inellegant
I now have something that works and seems **much** faster. I am using COALESCE in a recursive CTE. My schema is way to bug to post as an example, so I will build a small example and post it if you are still interrested.
Sometimes SQL is just plain fun
May 11, 2006 at 12:47 pm
Of course I'm interested!
Not that I have access to sql 2005 yet, but still, I am interested
Did the example I gave illustrate your problem?
And were the techniques of any interest? Well the 2nd one anyway...
Ryan Randall
Solutions are easy. Understanding the problem, now, that's the hard part.
May 13, 2006 at 4:33 am
Ryan,
In did not forget the example. While coding it I discovered a flaw in my logic (I can end up with a data anomaly with the "first" row, i.e. the one returned by the anchor of the CTE). I tried to fix it and even tried doing the recursivity backward in time instead of forward before realizing it was foolish.
I spent Friday at 35000 thousand feets and had food poisonning from a sushi restaurant at the Detroit airport
I will look at the problem again Sunday.
L
May 13, 2006 at 4:06 pm
Here it goes:
/*
For performance reasons, I am using a RECVER (record version) column instead of the update timestamp.
That column exists to support optimistic concurrency across n database vendors
It is always updated by the application and this way I don't have to use a datetime in the PK of the log table (which is a bad practice)
We can get rid of RECVER and use UpdatedTime instead by using a boundary query but I choose to stick to RECVER instead
*/
use
masterif
exists(Select * from sys.databases where [name] = N'LoggingExample')Drop database LoggingExampleGO
Create
database LoggingExampleGO
Use
LoggingExampleGO
Create
table ProductionTable(PK
int NOT NULL IDENTITY(1,1) PRIMARY KEY CLUSTERED,UpdatedBy
varchar(20),UpdatedTime
datetime,Col1
varchar(30) NOT NULL,Col2
varchar(30) NOT NULL,Col3
int NOT NULL,RECVER
int NOT NULL)GO
Create
table LogTable(PK
int NOT NULL,UpdatedBy
varchar(20) NOT NULL,UpdatedTime
datetime NOT NULL,Col1
varchar(30) NULL,Col2
varchar(30) NULL,Col3
int NULL,RECVER
int,CONSTRAINT [PK_LogTable] PRIMARY KEY CLUSTERED ([PK]
ASC,[RECVER]
ASC))
GO
-- this one is used for debugging only
Create
table ProofTable(PK
int NOT NULL,UpdatedBy
varchar(20) NOT NULL,UpdatedTime
datetime NOT NULL,Col1
varchar(30) NULL,Col2
varchar(30) NULL,Col3
int NULL,RECVER
int,CONSTRAINT [PK_ProofTable] PRIMARY KEY CLUSTERED ([PK]
ASC,[RECVER]
ASC))
GO
/*
Two triggers to audit the table:
only updates and deletes are tracked
we log only the delta (columns that have change) in the log table
we log everything in the profftable to help debug
*/
Create
Trigger TRINS_ProductionTable on [ProductionTable]For
Updateas
SET
NOCOUNT ONInsert
into LogTable(PK, UpdatedBy, UpdatedTime, Col1, Col2, Col3, RECVER)Select
I.PK,D
.UpdatedBy,D
.UpdatedTime,NULLIF(D.Col1, I.Col1),NULLIF(D.Col2, I.Col2),NULLIF(D.Col3, I.Col3),D
.RECVERFrom
Inserted asinner
join Deleted as [D] on I.PK = D.PK;-- Now log a second time to help testing
Insert
into [ProofTable](PK, UpdatedBy, UpdatedTime, Col1, Col2, Col3, RECVER)Select
PK, UpdatedBy, UpdatedTime, Col1, Col2, Col3, RECVER From Deleted;GO
Create
Trigger TRIDEL_ProductionTable on [ProductionTable]For
Deleteas
SET
NOCOUNT ONInsert
into LogTable(PK, UpdatedBy, UpdatedTime, Col1, Col2, Col3, RECVER)Select
PK, UpdatedBy, UpdatedTime, Col1, Col2, Col3, RECVER From Deleted;-- Now log a second time to help testing
Insert
into [ProofTable](PK, UpdatedBy, UpdatedTime, Col1, Col2, Col3, RECVER)Select
PK, UpdatedBy, UpdatedTime, Col1, Col2, Col3, RECVER From Deleted;GO
--TRUNCATE TABLE [ProductionTable]
--TRUNCATE TABLE [LogTable]
--TRUNCATE TABLE [ProofTable]
/*
Insert a row and modify it a few times
real test is done with a few million rows
*/
Declare @pk intInsert into ProductionTable (UpdatedBy, UpdatedTime, Col1, Col2, Col3, RECVER)Select 'Bob',getdate(),'First Value','Some thing...',77, 1Select @pk = SCOPE_IDENTITY()Update ProductionTable Set UpdatedBy = 'John',UpdatedTime
= dateadd(mi,10,getdate()),Col1
= 'First Value',Col2
= 'Some thing...',Col3
= 78, -- CHANGERECVER
= RECVER + 1Where PK = @pk-- Update ProductionTable Set UpdatedBy = 'Billy2',UpdatedTime
= dateadd(mi,20,getdate()),Col1
= 'Second Value', -- CHANGECol2
= 'Something nice...', -- CHANGECol3
= 78,RECVER
= RECVER + 1Where PK = @pkUpdate ProductionTable Set UpdatedBy = 'Billy3',UpdatedTime
= dateadd(mi,30,getdate()),Col1
= 'Second Value',Col2
= 'Something nice...',Col3
= 78,RECVER
= RECVER + 1Where PK = @pk-- CHANGEUpdate ProductionTable Set UpdatedBy = 'Billy4',UpdatedTime
= dateadd(mi,40,getdate()),Col1
= 'Second Value',Col2
= 'Something nice...',Col3
= 79, -- CHANGERECVER
= RECVER + 1Where PK = @pkGO
----------------------physical state of the log------------------------
--- Notice the NULLs we want to fill
-- you can visualize it by playing Tetris with the resultset:
-- we must push the values down in the NULL cells
Select
PK, UpdatedBy, UpdatedTime, Col1, Col2, Col3, RECVERFrom
ProductionTableUNION
ALLSelect
PK, UpdatedBy, UpdatedTime, Col1, Col2, Col3, RECVERFrom
LogTableOrder
By RECVER desc---------------------Log we want to expose--------------------
Select
PK, UpdatedBy, UpdatedTime, Col1, Col2, Col3, RECVERFrom
ProductionTableUNION
ALLsELECT
* FROM [ProofTable]Order
By RECVER desc-------------------- Backward in time ------------------------
-- COALESCE Vertically by pushing values down the stack
-- CTEs can be changed into VIEWS if necessary
With
TimeLog as(Select PK, UpdatedBy, UpdatedTime, Col1, Col2, Col3, RECVERFrom ProductionTableUNION ALLSelect PK, UpdatedBy, UpdatedTime, Col1, Col2, Col3, RECVERFrom LogTable),TimeSlice
as (Select PK, UpdatedBy, UpdatedTime, Col1, Col2, Col3, RECVERFrom TimeLog as [t1]Where RECVER = (Select MAX(RECVER) From TimeLog t2 where t2.PK = t1.PK)UNION ALL Select TimeSlice.PK,L
.UpdatedBy, L.UpdatedTime, COALESCE(L.Col1, TimeSlice.Col1), COALESCE(L.Col2, TimeSlice.Col2),COALESCE(L.Col3, TimeSlice.Col3),L
.RECVERFrom TimeLog as [L]join TimeSlice on L.PK = TimeSlice.PK and L.RECVER = (TimeSlice.RECVER-1))Select
TS.PK,TS
.UpdatedBy, TS.UpdatedTime, COALESCE(TS.Col1, TS2.Col1) as [Col1], COALESCE(TS.Col2, TS2.Col2) as [Col2],COALESCE(TS.Col3, TS2.Col3) as [Col3],TS
.RECVERfrom
TimeSlice TSleft
join TimeSlice TS2 on TS.PK = TS2.PK and TS.RECVER = (TS2.RECVER-1)and TS2.RECVER=2 and TS.RECVER=1Order
By RECVER desc;May 15, 2006 at 6:10 am
Thanks lakusha. I'll have to wait till we get SQL 2005 to be able to work through this, but it makes interesting reading in the meantime!
Ryan Randall
Solutions are easy. Understanding the problem, now, that's the hard part.
May 15, 2006 at 7:26 am
Care to expand on this idea???
"...I don't have to use a datetime in the PK of the log table (which is a bad practice)"
May 15, 2006 at 12:15 pm
Datetime is an imprecise datatype. You always have +-3ms.
Try this:
Select cast('2006-05-15 23:59:59.999' as datetime)
And it happens more often then you think. The last digit is always rounded by SQL to 0, 3 or 7. That means that the data you read can differ from the one you inserted.
It also mean that two row inserted sequencially can end up with the exact same value. In a PK that means an access violation and in application logic it means it can screw up the sequence of events in your log.
May 15, 2006 at 2:13 pm
Now that you are looking for performance, Allow me to also say that your triggers should check for different values and only then insert the change in the log.
The reason for the PK issues is that the PK of the the log table should be TRANSACTIONID instead of PK and datetime. It is, as you've already discovered not a good idea.
I my opinion all this pivoting should be done client side instead but if you managed to do it with a CTE it should be fine
Cheers,
* Noel
Viewing 15 posts - 1 through 15 (of 16 total)
You must be logged in to reply to this topic. Login to reply