COALESCE Vertically??

Question

Post reply

COALESCE Vertically??

lakusha

SSCommitted

Points: 1857
More actions
May 11, 2006 at 9:57 am

#114234

I am trying to find a set based solution to retrieve the first not null value for each column in a time ordered set. Cursors and UDF are excluded for performance reasons. I considered a UDA in C# but the IsInvariantToOrder attribut is not yet supported by the query processor and UDA do not support varchar(MAX) and XML types yet.
I have audit/history tables structured like this:
Create
table LogTable( PK int primary key clustered,
UpdatedTS datetime NOT NULL,
UpdatedBy varchar(30) NOT NULL,
Col1 varchar(30) NULL,
Col2 int NULL,
ColN int NULL)
a) Inserts are not logged
b) only updated columns are logged
c) a deleted row is fully logged
This enables to have no impact on inserts and few impact on updates. The UpdatedTS and UpdatedBy columns are also in the primary table. This minimizes IO. The usual log structure (pk, date, user, old valu, new value) has too much IO overhead and must log everything as strings.
When I am slicing the table to retrieve a row version at a specific time, for each NULL column I need to retrieve the most recent not null value (if they are all null, the value comes from the primary table since that column was never updated)
I have been scratching my head to find a good set based solution for this. Anybody has a suggestion?

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 1

Something like :

Select ColName, NewValue from dbo.LogTable LT inner join

(Select Max(PK), ColName FROM dbo.LogTable group by ColName) dtLastVersion

on LT.pk = dtLastVersion.PK

Once you have that, have the application or calling proc figure out what missing data there is and send it out to the user. Also you must take into consideration that a null value can be a valid VALUE and not only missing data in the history.

That's why I always opt for a copy of the original table when I do a log. That way you can retrieve the whole row in a single step.

lakusha SSCommitted Points: 1857 More actions · Answer 2

Thanks for replying. I am doing a delta instead of a full copy because copying the full row would kill the server. We get thousands of tran per minutes. Some tables are wide and have LOBs in them.

Your solution does not work (the way I interpret it). The PK of the row doesn't change, just the timestamp. The MAX(PK) makes event less sense to me. I am not trying to output an OldValue/NewValue but a complete row for each point in time.

I've tried several approaches, but nothing truly elegant yet. My current experiment is to use a recursive CTE to fully materialize each time slice, but I am still struggling with it. A bit of challenge to brigthen the day

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 3

ok here's a modified version :

Select ColName, Value from dbo.LogTable LT inner join

(Select Max(UpdatedTS), ColName FROM dbo.LogTable where Pk = @pk group by ColName) dtLastVersion

on LT.UpdatedTS = dtLastVersion.UpdatedTS and LT.PK = @pk

This will give you something like :

LName = John

FName = Smith

...

You'll have to pivot that into a single row to have the latest version. But then again you might have missing columns in there because they might not have been updated at all since the insert.

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 4

... and if you're trying to rebuild the whole history of the row then I have no elegant solution to offer. You might be able to pivot the changes on the date to know all the columns that were updated in a single shot. But atfter that, without a starting point you'll have to go back in time to rebuild the history backwards from the current data. I can't think of a simple solution for this one.

RyanRandall SSChampion Points: 13623 More actions · Answer 5

I don't know how you guys can work through these things without examples. I find it impossible .

RGR - I don't get what you're getting at - could you post it with some example data. I've invented some below if that helps.

lakusha - Interesting problem! Is the example data I've invented sufficient to illustrate your problem, and is the target I've given correct? If not, would you please provide an example and the target results.

I've come up with 2 ways. The first is highly inelegant and inefficient, I'd say, and the second is semi-elegant . i.e. neither is ideal!

Take a look and see what you think...

--data

declare @LogTable table ( PK int primary key clustered, UpdatedTS datetime NOT NULL,

UpdatedBy varchar(30) NOT NULL, Col1 varchar(30) NULL, Col2 int NULL, ColN int NULL)

insert @LogTable

select 1, '20060101', 'A', null, 1, null

union all select 2, '20060102', 'A', null, 2, null

union all select 3, '20060103', 'A', 'C', null, 4

union all select 4, '20060104', 'A', 'B', null, null

--target: 'B', 2, 4

--calculation (one idea)

select

(select Col1 from @LogTable where UpdatedTS = (select max(UpdatedTS) from @LogTable where Col1 is not null)),

(select Col2 from @LogTable where UpdatedTS = (select max(UpdatedTS) from @LogTable where Col2 is not null)),

(select ColN from @LogTable where UpdatedTS = (select max(UpdatedTS) from @LogTable where ColN is not null))

--calculation (another idea)

declare @Col1 varchar(30), @Col2 int, @ColN int

select

@Col1 = case when @Col1 is null and Col1 is not null then Col1 else @Col1 end,

@Col2 = case when @Col2 is null and Col2 is not null then Col2 else @Col2 end,

@ColN = case when @ColN is null and ColN is not null then ColN else @ColN end

from @LogTable

order by UpdatedTS desc

select @Col1, @Col2, @ColN

Ryan Randall

Solutions are easy. Understanding the problem, now, that's the hard part.

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 6

Sorry, emergency time here... no more time on ssc this week...

lakusha SSCommitted Points: 1857 More actions · Answer 7

Sorry about the lack of examples Ryan. I posted the table structure but I had so many alternatives I had tried it would have been confusing. Some of them worked but were, urg, inellegant

I now have something that works and seems **much** faster. I am using COALESCE in a recursive CTE. My schema is way to bug to post as an example, so I will build a small example and post it if you are still interrested.

Sometimes SQL is just plain fun

RyanRandall SSChampion Points: 13623 More actions · Answer 8

Of course I'm interested!

Not that I have access to sql 2005 yet, but still, I am interested

Did the example I gave illustrate your problem?

And were the techniques of any interest? Well the 2nd one anyway...

Ryan Randall

Solutions are easy. Understanding the problem, now, that's the hard part.

lakusha SSCommitted Points: 1857 More actions · Answer 9

Ryan,

In did not forget the example. While coding it I discovered a flaw in my logic (I can end up with a data anomaly with the "first" row, i.e. the one returned by the anchor of the CTE). I tried to fix it and even tried doing the recursivity backward in time instead of forward before realizing it was foolish.

I spent Friday at 35000 thousand feets and had food poisonning from a sushi restaurant at the Detroit airport

I will look at the problem again Sunday.

L

lakusha SSCommitted Points: 1857 More actions · Answer 10

Here it goes:

/*

For performance reasons, I am using a RECVER (record version) column instead of the update timestamp.

That column exists to support optimistic concurrency across n database vendors

It is always updated by the application and this way I don't have to use a datetime in the PK of the log table (which is a bad practice)

We can get rid of RECVER and use UpdatedTime instead by using a boundary query but I choose to stick to RECVER instead

*/

use

master

if

exists(Select * from sys.databases where [name] = N'LoggingExample')

Drop database LoggingExample

GO

Create

database LoggingExample

GO

Use

LoggingExample

GO

Create

table ProductionTable(

PK

int NOT NULL IDENTITY(1,1) PRIMARY KEY CLUSTERED,

UpdatedBy

varchar(20),

UpdatedTime

datetime,

Col1

varchar(30) NOT NULL,

Col2

varchar(30) NOT NULL,

Col3

int NOT NULL,

RECVER

int NOT NULL)

GO

Create

table LogTable(

PK

int NOT NULL,

UpdatedBy

varchar(20) NOT NULL,

UpdatedTime

datetime NOT NULL,

Col1

varchar(30) NULL,

Col2

varchar(30) NULL,

Col3

int NULL,

RECVER

int,

CONSTRAINT [PK_LogTable] PRIMARY KEY CLUSTERED

(

[PK]

ASC,

[RECVER]

ASC

)

GO

-- this one is used for debugging only

Create

table ProofTable(

PK

int NOT NULL,

UpdatedBy

varchar(20) NOT NULL,

UpdatedTime

datetime NOT NULL,

Col1

varchar(30) NULL,

Col2

varchar(30) NULL,

Col3

int NULL,

RECVER

int,

CONSTRAINT [PK_ProofTable] PRIMARY KEY CLUSTERED

(

[PK]

ASC,

[RECVER]

ASC

)

GO

/*

Two triggers to audit the table:

only updates and deletes are tracked

we log only the delta (columns that have change) in the log table

we log everything in the profftable to help debug

*/

Create

Trigger TRINS_ProductionTable on [ProductionTable]

For

Update

as

SET

NOCOUNT ON

Insert

into LogTable(PK, UpdatedBy, UpdatedTime, Col1, Col2, Col3, RECVER)

Select

I.PK,

D

.UpdatedBy,

D

.UpdatedTime,

NULLIF(D.Col1, I.Col1),

NULLIF(D.Col2, I.Col2),

NULLIF(D.Col3, I.Col3),

D

.RECVER

From

Inserted as

inner

join Deleted as [D] on I.PK = D.PK;

-- Now log a second time to help testing

Insert

into [ProofTable](PK, UpdatedBy, UpdatedTime, Col1, Col2, Col3, RECVER)

Select

PK, UpdatedBy, UpdatedTime, Col1, Col2, Col3, RECVER From Deleted;

GO

Create

Trigger TRIDEL_ProductionTable on [ProductionTable]

For

Delete

as

SET

NOCOUNT ON

Insert

into LogTable(PK, UpdatedBy, UpdatedTime, Col1, Col2, Col3, RECVER)

Select

PK, UpdatedBy, UpdatedTime, Col1, Col2, Col3, RECVER From Deleted;

-- Now log a second time to help testing

Insert

into [ProofTable](PK, UpdatedBy, UpdatedTime, Col1, Col2, Col3, RECVER)

Select

PK, UpdatedBy, UpdatedTime, Col1, Col2, Col3, RECVER From Deleted;

GO

--TRUNCATE TABLE [ProductionTable]

--TRUNCATE TABLE [LogTable]

--TRUNCATE TABLE [ProofTable]

/*

Insert a row and modify it a few times

real test is done with a few million rows

*/

Declare @pk int

Insert into ProductionTable (UpdatedBy, UpdatedTime, Col1, Col2, Col3, RECVER)

Select 'Bob',getdate(),'First Value','Some thing...',77, 1

Select @pk = SCOPE_IDENTITY()

Update ProductionTable

Set UpdatedBy = 'John',

UpdatedTime

= dateadd(mi,10,getdate()),

Col1

= 'First Value',

Col2

= 'Some thing...',

Col3

= 78, -- CHANGE

RECVER

= RECVER + 1

Where PK = @pk

--

Update ProductionTable

Set UpdatedBy = 'Billy2',

UpdatedTime

= dateadd(mi,20,getdate()),

Col1

= 'Second Value', -- CHANGE

Col2

= 'Something nice...', -- CHANGE

Col3

= 78,

RECVER

= RECVER + 1

Where PK = @pk

Update ProductionTable

Set UpdatedBy = 'Billy3',

UpdatedTime

= dateadd(mi,30,getdate()),

Col1

= 'Second Value',

Col2

= 'Something nice...',

Col3

= 78,

RECVER

= RECVER + 1

Where PK = @pk

-- CHANGE

Update ProductionTable

Set UpdatedBy = 'Billy4',

UpdatedTime

= dateadd(mi,40,getdate()),

Col1

= 'Second Value',

Col2

= 'Something nice...',

Col3

= 79, -- CHANGE

RECVER

= RECVER + 1

Where PK = @pk

GO

----------------------physical state of the log------------------------

--- Notice the NULLs we want to fill

-- you can visualize it by playing Tetris with the resultset:

-- we must push the values down in the NULL cells

Select

PK, UpdatedBy, UpdatedTime, Col1, Col2, Col3, RECVER

From

ProductionTable

UNION

ALL

Select

PK, UpdatedBy, UpdatedTime, Col1, Col2, Col3, RECVER

From

LogTable

Order

By RECVER desc

---------------------Log we want to expose--------------------

Select

PK, UpdatedBy, UpdatedTime, Col1, Col2, Col3, RECVER

From

ProductionTable

UNION

ALL

sELECT

* FROM [ProofTable]

Order

By RECVER desc

-------------------- Backward in time ------------------------

-- COALESCE Vertically by pushing values down the stack

-- CTEs can be changed into VIEWS if necessary

With

TimeLog as

(

Select PK, UpdatedBy, UpdatedTime, Col1, Col2, Col3, RECVER

From ProductionTable

UNION ALL

Select PK, UpdatedBy, UpdatedTime, Col1, Col2, Col3, RECVER

From LogTable

),

TimeSlice

as

(

Select PK, UpdatedBy, UpdatedTime, Col1, Col2, Col3, RECVER

From TimeLog as [t1]

Where RECVER = (Select MAX(RECVER) From TimeLog t2 where t2.PK = t1.PK)

UNION ALL

Select TimeSlice.PK,

L

.UpdatedBy, L.UpdatedTime,

COALESCE(L.Col1, TimeSlice.Col1),

COALESCE(L.Col2, TimeSlice.Col2),

COALESCE(L.Col3, TimeSlice.Col3),

L

.RECVER

From TimeLog as [L]

join TimeSlice on L.PK = TimeSlice.PK and L.RECVER = (TimeSlice.RECVER-1)

)

Select

TS.PK,

TS

.UpdatedBy, TS.UpdatedTime,

COALESCE(TS.Col1, TS2.Col1) as [Col1],

COALESCE(TS.Col2, TS2.Col2) as [Col2],

COALESCE(TS.Col3, TS2.Col3) as [Col3],

TS

.RECVER

from

TimeSlice TS

left

join TimeSlice TS2 on TS.PK = TS2.PK and TS.RECVER = (TS2.RECVER-1)

and TS2.RECVER=2 and TS.RECVER=1

Order

By RECVER desc;

RyanRandall SSChampion Points: 13623 More actions · Answer 11

Thanks lakusha. I'll have to wait till we get SQL 2005 to be able to work through this, but it makes interesting reading in the meantime!

Ryan Randall

Solutions are easy. Understanding the problem, now, that's the hard part.

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 12

Care to expand on this idea???

"...I don't have to use a datetime in the PK of the log table (which is a bad practice)"

lakusha SSCommitted Points: 1857 More actions · Answer 13

Datetime is an imprecise datatype. You always have +-3ms.

Try this:

Select cast('2006-05-15 23:59:59.999' as datetime)

And it happens more often then you think. The last digit is always rounded by SQL to 0, 3 or 7. That means that the data you read can differ from the one you inserted.

It also mean that two row inserted sequencially can end up with the exact same value. In a PK that means an access violation and in application logic it means it can screw up the sequence of events in your log.

noeld SSC Guru Points: 96590 More actions · Answer 14

Now that you are looking for performance, Allow me to also say that your triggers should check for different values and only then insert the change in the log.

The reason for the PK issues is that the PK of the the log table should be TRANSACTIONID instead of PK and datetime. It is, as you've already discovered not a good idea.

I my opinion all this pivoting should be done client side instead but if you managed to do it with a CTE it should be fine

Cheers,

* Noel