Update current period data based on previous period data

  • Paul White NZ (4/6/2010)


    yingchai (4/6/2010)


    I'm not so concern about the I/O performance as I predict the Cost fact table will grow to approximately 1 mil records in 10 years. Currently, the database server is running on 16 GB RAM. Should be no problem right?

    Given correct indexing, it will be fine.

    Hi Paul,

    Can you elaborate more on that? Or can you link me to some articles on this matter.

    Currently, all my Fact tables do not have Primary keys and not too sure whether it is needed for indexing...

  • yingchai (4/6/2010)


    Currently, all my Fact tables do not have Primary Keys...

    Your fact table should always have a Primary Key - whether it is a surrogate key, or based on your dimension foreign keys. Anyway, that's a different discussion.

    As far as indexing on your UPDATE query is concerned:

    USE tempdb;

    GO

    IF OBJECT_ID(N'dbo.Sample1')

    IS NOT NULL

    DROP TABLE dbo.Sample1;

    GO

    CREATE TABLE dbo.Sample1

    (

    product_id CHAR(7) NOT NULL,

    entity_id CHAR(3) NOT NULL,

    period INTEGER NOT NULL,

    cost SMALLMONEY NULL

    );

    GO

    INSERT Sample1

    (product_id, entity_id, period, cost)

    SELECT 'CMP1000', 'PM1', 20090101, 100 UNION ALL

    SELECT 'CMP1000', 'PM1', 20090201, 80 UNION ALL

    SELECT 'CMP1000', 'PM1', 20090301, NULL UNION ALL

    SELECT 'CMP1000', 'PM1', 20090401, 110 UNION ALL

    SELECT 'CMP1000', 'PM1', 20090501, NULL UNION ALL

    SELECT 'CMP1000', 'PM1', 20090601, NULL UNION ALL

    SELECT 'CMP1000', 'PM1', 20090701, NULL;

    -- Index required for good performance

    CREATE UNIQUE NONCLUSTERED INDEX nc1

    ON dbo.Sample1

    (product_id, entity_id, period DESC)

    INCLUDE (cost);

    GO

    UPDATE S

    SET cost = CA.cost

    FROM Sample1 S

    CROSS

    APPLY (

    SELECT TOP (1)

    S2.cost

    FROM Sample1 S2

    WHERE S2.product_id = S.product_id

    AND S2.entity_id = S.entity_id

    AND S2.cost IS NOT NULL

    AND S2.period <= S.period

    ORDER BY

    S2.period DESC

    ) CA

    WHERE S.cost IS NULL;

    GO

  • yingchai (4/6/2010)


    I'm not so concern about the I/O performance as I predict the Cost fact table will grow to approximately 1 mil records in 10 years. Currently, the database server is running on 16 GB RAM. Should be no problem right?

    Heh... if you say so. Me? I have a real problem with only 10,000 rows causing a quarter million reads (even in the presence of proper indexing) and taking 36 seconds without the proper indexing and 4 seconds even with the proper indexing to execute on a 4 processor 8 GB system. It's called "technical debt" and everytime I've seen someone build in such technical dept, they've found some way to regret it in the very near future.

    4 seconds per 10,000 rows in the worst case scenario of having to smear a single row of data over 10,000 other rows and the problem grows exponentially ((N2+N)/2). Maybe you'll get lucky and the worst case won't actually happen (and there's a good chance in this scenario that it won't happen, but it's still a "chance"). Maybe you'll get lucky and someone won't copy the code and use its essence on something where the worst case will happen.

    There's no doubt that the code is clever and that it'll work reasonably well (albeit with some unnecessary extra reads) when a worst case doesn't happen... I just have a hard time with taking such a gamble.

    Heh... ok. I'll get off the soapbox now. It's up to you.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Please do read Jeff's article on Triangular Joins so you understand what an important concept it is.

    It will also help you to understand why the code I posted does not suffer from the problem, given the index provided.

  • Paul White NZ (4/6/2010)


    Please do read Jeff's article on Triangular Joins so you understand what an important concept it is.

    It will also help you to understand why the code I posted does not suffer from the problem, given the index provided.

    Maybe this will help you to understand why the code you posted does suffer from the same problem even given the index provided... 😉

    First, a bit more test data... a paltry 10K rows just like I said previously...

    USE tempdb;

    GO

    IF OBJECT_ID(N'dbo.Sample1')

    IS NOT NULL

    DROP TABLE dbo.Sample1;

    GO

    CREATE TABLE dbo.Sample1

    (

    product_id CHAR(7) NOT NULL,

    entity_id CHAR(3) NOT NULL,

    period DATETIME NOT NULL,

    cost SMALLMONEY NULL

    );

    GO

    --===== Populate the test table with 10k rows of similar data

    WITH

    cteTally AS

    (

    SELECT ROW_NUMBER() OVER (ORDER BY ac1.Object_ID) AS N

    FROM Master.sys.All_Columns ac1

    CROSS JOIN Master.sys.All_Columns ac2

    )

    INSERT INTO dbo.Sample1

    (product_id, entity_id, period, cost)

    SELECT 'CMP1000', 'PM1', DATEADD(dd,t.N-1,'20100101'), NULL

    FROM cteTally t

    WHERE t.N BETWEEN 1 AND 10000;

    --===== Seed the first row with data

    UPDATE dbo.Sample1

    SET Cost = 123 WHERE Period = '20100101';

    -- Index required for good performance

    CREATE UNIQUE NONCLUSTERED INDEX nc1

    ON dbo.Sample1

    (product_id, entity_id, period DESC)

    INCLUDE (cost);

    GO

    ... and now the test...

    SET NOCOUNT ON;

    SET STATISTICS IO ON;

    SET STATISTICS TIME ON;

    PRINT '********** Simple UPDATE for comparison **********';

    UPDATE dbo.Sample1

    SET Cost = NULL;

    PRINT '********** Triangular Join **********';

    UPDATE S

    SET cost = CA.cost

    FROM Sample1 S

    CROSS

    APPLY (

    SELECT TOP (1)

    S2.cost

    FROM Sample1 S2

    WHERE S2.product_id = S.product_id

    AND S2.entity_id = S.entity_id

    AND S2.cost IS NOT NULL

    AND S2.period <= S.period

    ORDER BY

    S2.period DESC

    ) CA

    WHERE S.cost IS NULL;

    SET STATISTICS TIME OFF;

    SET STATISTICS IO OFF;

    SET NOCOUNT OFF;

    ...and, of course, the results...

    ********** Simple UPDATE for comparison **********

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 0 ms.

    Table 'Sample1'. Scan count 1, logical reads 41, physical reads 0, read-ahead reads 0,

    lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 94 ms, elapsed time = 104 ms.

    ********** Triangular Join **********

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 0 ms.

    Table 'Sample1'. [font="Arial Black"]Scan count 10001, logical reads 253067[/font], physical reads 0, read-ahead reads 3,

    lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 1, logical reads 0, physical reads 0, read-ahead reads 0,

    lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    [font="Arial Black"]CPU time = 39094 ms, elapsed time = 43389 ms[/font].

    Like I admitted, because of the nature of the data for this problem, you might get lucky and not have such a thing happen but the code still has the problem. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (4/6/2010)


    Maybe this will help you to understand why the code you posted does suffer from the same problem even given the index provided.

    Consider this code my apple sauce to your pork chop 😉

    I took the liberty of changing the test order so that the cost column 'comparison simple update' did not overwrite the strategically placed single non-NULL value before running my code.

    Test data

    USE tempdb;

    GO

    IF OBJECT_ID(N'dbo.Sample1')

    IS NOT NULL

    DROP TABLE dbo.Sample1;

    GO

    CREATE TABLE dbo.Sample1

    (

    product_id CHAR(7) NOT NULL,

    entity_id CHAR(3) NOT NULL,

    period DATETIME NOT NULL,

    cost SMALLMONEY NULL,

    -- Computed column (not persisted yet)

    flag_null AS CASE WHEN cost IS NULL THEN 1 ELSE 0 END

    );

    GO

    --===== Populate the test table with 10k rows of similar data - cost all NULL

    WITH Numbers (n)

    AS (

    SELECT ROW_NUMBER()

    OVER (

    ORDER BY (SELECT 0))

    FROM master.sys.all_columns ac1,

    master.sys.all_columns ac2

    )

    INSERT dbo.Sample1

    (product_id, entity_id, period, cost)

    SELECT 'CMP1000', 'PM1', DATEADD(DAY, N.n - 1, '2010-01-01T00:00:00'), NULL

    FROM Numbers N

    WHERE N.n BETWEEN 1 AND 10000;

    --===== Seed the first row with data

    UPDATE dbo.Sample1

    SET Cost = 123

    WHERE Period = '20100101';

    -- Index required for good performance

    CREATE UNIQUE CLUSTERED INDEX nc1

    ON dbo.Sample1

    (flag_null, product_id, entity_id, period DESC);

    GO

    The test

    SET NOCOUNT ON;

    SET STATISTICS IO, TIME ON;

    PRINT '********** Triangular Join? **********';

    UPDATE S

    SET cost = CA.cost

    FROM Sample1 S

    CROSS

    APPLY (

    SELECT TOP (1)

    S2.cost

    FROM Sample1 S2

    WHERE S2.product_id = S.product_id

    AND S2.entity_id = S.entity_id

    AND S2.flag_null = 0

    AND S2.period <= S.period

    ORDER BY

    S2.period DESC

    ) CA

    WHERE S.flag_null = 1;

    PRINT '********** Simple UPDATE for comparison **********';

    UPDATE dbo.Sample1

    SET Cost = NULL;

    SET STATISTICS IO, TIME OFF;

    SET NOCOUNT OFF;

    Results:

    ********** Triangular Join? **********

    Table 'Sample1'. Scan count 10000, logical reads 61782

    Table 'Worktable'. Scan count 1, logical reads 20348

    CPU time = 282 ms, elapsed time = 280 ms.

    ********** Simple UPDATE for comparison **********

    Table 'Sample1'. Scan count 1, logical reads 40477

    Table 'Worktable'. Scan count 1, logical reads 20290

    CPU time = 250 ms, elapsed time = 298 ms.

    The issue with the previous code was with the indexing. It is not possible to perform two range seeks in one index operation, so the cost test became a predicate. The solution above redresses this oversight, and produces excellent performance regardless of the NULL distribution.

    The above repro still represents the worst case. Best case is when there are no NULL values:

    Table 'Sample1'. Scan count 1, logical reads 2

    Table 'Worktable'. Scan count 1, logical reads 0

    CPU time = 0 ms, elapsed time = 0 ms.

  • VERY clever. Seriously. And, removing the following line doesn't hurt much...

    WHERE S.flag_null = 1

    In that same vain (playing around with your code), it also doesn't seem to hurt if the caclulated column is a real column defaulted to 0.

    Yes, it's still a Triangular Join although it's an incredibly fast one. Nicely done. You've also given me an idea that I have to try. If it works out, I'll let you know.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (4/6/2010)


    VERY clever. Seriously.

    Thanks! :blush:

    I have my moments...and you are a pain for making me spend half an hour finding that optimization!

    Just kidding - I loved the challenge, actually.

    Yes, it's still a Triangular Join although it's an incredibly fast one. Nicely done. You've also given me an idea that I have to try. If it works out, I'll let you know.

    Technically, I suppose it is, on deeper reflection.

    Paul

    edit: Changed my mind about the triangle thing.

Viewing 8 posts - 16 through 22 (of 22 total)

You must be logged in to reply to this topic. Login to reply