Update current period data based on previous period data

Question

Update current period data based on previous period data

Viewing 8 posts - 16 through 23 (of 23 total)

You must be logged in to reply to this topic. Login to reply

yingchai SSCommitted Points: 1669 More actions · Answer 1

Paul White NZ (4/6/2010)
yingchai (4/6/2010)
I'm not so concern about the I/O performance as I predict the Cost fact table will grow to approximately 1 mil records in 10 years. Currently, the database server is running on 16 GB RAM. Should be no problem right?
Given correct indexing, it will be fine.

Hi Paul,

Can you elaborate more on that? Or can you link me to some articles on this matter.

Currently, all my Fact tables do not have Primary keys and not too sure whether it is needed for indexing...

Paul White SSC Guru Points: 150468 More actions · Answer 2

yingchai (4/6/2010)
Currently, all my Fact tables do not have Primary Keys...

Your fact table should always have a Primary Key - whether it is a surrogate key, or based on your dimension foreign keys. Anyway, that's a different discussion.

As far as indexing on your UPDATE query is concerned:

USE tempdb;

GO

IF OBJECT_ID(N'dbo.Sample1')

IS NOT NULL

DROP TABLE dbo.Sample1;

GO

CREATE TABLE dbo.Sample1

(

product_id CHAR(7) NOT NULL,

entity_id CHAR(3) NOT NULL,

period INTEGER NOT NULL,

cost SMALLMONEY NULL

);

GO

INSERT Sample1

(product_id, entity_id, period, cost)

SELECT 'CMP1000', 'PM1', 20090101, 100 UNION ALL

SELECT 'CMP1000', 'PM1', 20090201, 80 UNION ALL

SELECT 'CMP1000', 'PM1', 20090301, NULL UNION ALL

SELECT 'CMP1000', 'PM1', 20090401, 110 UNION ALL

SELECT 'CMP1000', 'PM1', 20090501, NULL UNION ALL

SELECT 'CMP1000', 'PM1', 20090601, NULL UNION ALL

SELECT 'CMP1000', 'PM1', 20090701, NULL;

-- Index required for good performance

CREATE UNIQUE NONCLUSTERED INDEX nc1

ON dbo.Sample1

(product_id, entity_id, period DESC)

INCLUDE (cost);

GO

UPDATE S

SET cost = CA.cost

FROM Sample1 S

CROSS

APPLY (

SELECT TOP (1)

S2.cost

FROM Sample1 S2

WHERE S2.product_id = S.product_id

AND S2.entity_id = S.entity_id

AND S2.cost IS NOT NULL

AND S2.period <= S.period

ORDER BY

S2.period DESC

) CA

WHERE S.cost IS NULL;

GO

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 3

yingchai (4/6/2010)
I'm not so concern about the I/O performance as I predict the Cost fact table will grow to approximately 1 mil records in 10 years. Currently, the database server is running on 16 GB RAM. Should be no problem right?

Heh... if you say so. Me? I have a real problem with only 10,000 rows causing a quarter million reads (even in the presence of proper indexing) and taking 36 seconds without the proper indexing and 4 seconds even with the proper indexing to execute on a 4 processor 8 GB system. It's called "technical debt" and everytime I've seen someone build in such technical dept, they've found some way to regret it in the very near future.

4 seconds per 10,000 rows in the worst case scenario of having to smear a single row of data over 10,000 other rows and the problem grows exponentially ((N²+N)/2). Maybe you'll get lucky and the worst case won't actually happen (and there's a good chance in this scenario that it won't happen, but it's still a "chance"). Maybe you'll get lucky and someone won't copy the code and use its essence on something where the worst case will happen.

There's no doubt that the code is clever and that it'll work reasonably well (albeit with some unnecessary extra reads) when a worst case doesn't happen... I just have a hard time with taking such a gamble.

Heh... ok. I'll get off the soapbox now. It's up to you.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Paul White SSC Guru Points: 150468 More actions · Answer 4

Please do read Jeff's article on Triangular Joins so you understand what an important concept it is.

It will also help you to understand why the code I posted does not suffer from the problem, given the index provided.

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 5

Paul White NZ (4/6/2010)
Please do read Jeff's article on Triangular Joins so you understand what an important concept it is.
It will also help you to understand why the code I posted does not suffer from the problem, given the index provided.

Maybe this will help you to understand why the code you posted does suffer from the same problem even given the index provided... 😉

First, a bit more test data... a paltry 10K rows just like I said previously...

USE tempdb;

GO

IF OBJECT_ID(N'dbo.Sample1')

IS NOT NULL

DROP TABLE dbo.Sample1;

GO

CREATE TABLE dbo.Sample1

(

product_id CHAR(7) NOT NULL,

entity_id CHAR(3) NOT NULL,

period DATETIME NOT NULL,

cost SMALLMONEY NULL

);

GO

--===== Populate the test table with 10k rows of similar data

WITH

cteTally AS

(

SELECT ROW_NUMBER() OVER (ORDER BY ac1.Object_ID) AS N

FROM Master.sys.All_Columns ac1

CROSS JOIN Master.sys.All_Columns ac2

)

INSERT INTO dbo.Sample1

(product_id, entity_id, period, cost)

SELECT 'CMP1000', 'PM1', DATEADD(dd,t.N-1,'20100101'), NULL

FROM cteTally t

WHERE t.N BETWEEN 1 AND 10000;

--===== Seed the first row with data

UPDATE dbo.Sample1

SET Cost = 123 WHERE Period = '20100101';

-- Index required for good performance

CREATE UNIQUE NONCLUSTERED INDEX nc1

ON dbo.Sample1

(product_id, entity_id, period DESC)

INCLUDE (cost);

GO

... and now the test...

SET NOCOUNT ON;

SET STATISTICS IO ON;

SET STATISTICS TIME ON;

PRINT '********** Simple UPDATE for comparison **********';

UPDATE dbo.Sample1

SET Cost = NULL;

PRINT '********** Triangular Join **********';

UPDATE S

SET cost = CA.cost

FROM Sample1 S

CROSS

APPLY (

SELECT TOP (1)

S2.cost

FROM Sample1 S2

WHERE S2.product_id = S.product_id

AND S2.entity_id = S.entity_id

AND S2.cost IS NOT NULL

AND S2.period <= S.period

ORDER BY

S2.period DESC

) CA

WHERE S.cost IS NULL;

SET STATISTICS TIME OFF;

SET STATISTICS IO OFF;

SET NOCOUNT OFF;

...and, of course, the results...

********** Simple UPDATE for comparison **********

SQL Server Execution Times:

CPU time = 0 ms, elapsed time = 0 ms.

Table 'Sample1'. Scan count 1, logical reads 41, physical reads 0, read-ahead reads 0,

lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 94 ms, elapsed time = 104 ms.

********** Triangular Join **********

SQL Server Execution Times:

CPU time = 0 ms, elapsed time = 0 ms.

Table 'Sample1'. [font="Arial Black"]Scan count 10001, logical reads 253067[/font], physical reads 0, read-ahead reads 3,

lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Worktable'. Scan count 1, logical reads 0, physical reads 0, read-ahead reads 0,

lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

[font="Arial Black"]CPU time = 39094 ms, elapsed time = 43389 ms[/font].

Like I admitted, because of the nature of the data for this problem, you might get lucky and not have such a thing happen but the code still has the problem. 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Paul White SSC Guru Points: 150468 More actions · Answer 6

Jeff Moden (4/6/2010)
Maybe this will help you to understand why the code you posted does suffer from the same problem even given the index provided.

Consider this code my apple sauce to your pork chop 😉

I took the liberty of changing the test order so that the cost column 'comparison simple update' did not overwrite the strategically placed single non-NULL value before running my code.

Test data

USE tempdb;

GO

IF OBJECT_ID(N'dbo.Sample1')

IS NOT NULL

DROP TABLE dbo.Sample1;

GO

CREATE TABLE dbo.Sample1

(

product_id CHAR(7) NOT NULL,

entity_id CHAR(3) NOT NULL,

period DATETIME NOT NULL,

cost SMALLMONEY NULL,

-- Computed column (not persisted yet)

flag_null AS CASE WHEN cost IS NULL THEN 1 ELSE 0 END

);

GO

--===== Populate the test table with 10k rows of similar data - cost all NULL

WITH Numbers (n)

AS (

SELECT ROW_NUMBER()

OVER (

ORDER BY (SELECT 0))

FROM master.sys.all_columns ac1,

master.sys.all_columns ac2

)

INSERT dbo.Sample1

(product_id, entity_id, period, cost)

SELECT 'CMP1000', 'PM1', DATEADD(DAY, N.n - 1, '2010-01-01T00:00:00'), NULL

FROM Numbers N

WHERE N.n BETWEEN 1 AND 10000;

--===== Seed the first row with data

UPDATE dbo.Sample1

SET Cost = 123

WHERE Period = '20100101';

-- Index required for good performance

CREATE UNIQUE CLUSTERED INDEX nc1

ON dbo.Sample1

(flag_null, product_id, entity_id, period DESC);

GO

The test

SET NOCOUNT ON;

SET STATISTICS IO, TIME ON;

PRINT '********** Triangular Join? **********';

UPDATE S

SET cost = CA.cost

FROM Sample1 S

CROSS

APPLY (

SELECT TOP (1)

S2.cost

FROM Sample1 S2

WHERE S2.product_id = S.product_id

AND S2.entity_id = S.entity_id

AND S2.flag_null = 0

AND S2.period <= S.period

ORDER BY

S2.period DESC

) CA

WHERE S.flag_null = 1;

PRINT '********** Simple UPDATE for comparison **********';

UPDATE dbo.Sample1

SET Cost = NULL;

SET STATISTICS IO, TIME OFF;

SET NOCOUNT OFF;

Results:

********** Triangular Join? **********

Table 'Sample1'. Scan count 10000, logical reads 61782

Table 'Worktable'. Scan count 1, logical reads 20348

CPU time = 282 ms, elapsed time = 280 ms.

********** Simple UPDATE for comparison **********

Table 'Sample1'. Scan count 1, logical reads 40477

Table 'Worktable'. Scan count 1, logical reads 20290

CPU time = 250 ms, elapsed time = 298 ms.

The issue with the previous code was with the indexing. It is not possible to perform two range seeks in one index operation, so the cost test became a predicate. The solution above redresses this oversight, and produces excellent performance regardless of the NULL distribution.

The above repro still represents the worst case. Best case is when there are no NULL values:

Table 'Sample1'. Scan count 1, logical reads 2

Table 'Worktable'. Scan count 1, logical reads 0

CPU time = 0 ms, elapsed time = 0 ms.

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 7

VERY clever. Seriously. And, removing the following line doesn't hurt much...

WHERE S.flag_null = 1

In that same vain (playing around with your code), it also doesn't seem to hurt if the caclulated column is a real column defaulted to 0.

Yes, it's still a Triangular Join although it's an incredibly fast one. Nicely done. You've also given me an idea that I have to try. If it works out, I'll let you know.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Paul White SSC Guru Points: 150468 More actions · Answer 8

Jeff Moden (4/6/2010)
VERY clever. Seriously.

Thanks! :blush:

I have my moments...and you are a pain for making me spend half an hour finding that optimization!

Just kidding - I loved the challenge, actually.

Yes, it's still a Triangular Join although it's an incredibly fast one. Nicely done. You've also given me an idea that I have to try. If it works out, I'll let you know.

Technically, I suppose it is, on deeper reflection.

Paul

edit: Changed my mind about the triangle thing.

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi