Auto update table with correlated subquery

  • As always my weekend is stacked with fun family outings..

    But i will see if i have some time between Kids / basketball and finishing up some work. I will let you know.

    Jeff, i have read your articles about RBAR with great interest, and something tells me that you have a preferred solution. Just waiting to see if it is the corrrect one?

  • Thanks Hans, I appreciate the feedback... I had a great weekend, too. Heh... "Honey-Do" list and other labors of love.:D

    The reason I'm so adamant about testing this is I want to make a couple of points and maybe dispell a couple of myths in the process...

    Two people said that the given Inner Join solution (so far) would probably run better than the correlated sub-query under the right conditions. The reason why they said that is because lot's of folks "know" that correlated sub-queries are a bit like cursors in that the correlation must be resolved for each row being updated (in this case). What they don't know is that because of they way they wrote the join, the "right conditions" will never exist for the inner join, as they wrote it, to beat the correlated sub-query.

    You, on the other hand, are wise in not listening to what everyone else "knows" and have recognized the fact that, like anything else, there is a time and place for everything... even certain generally accepted "taboos". The proof in the pudding is the testing I've done (I'll show that in a minute) and regardless of what type and order an index is put on the table, the correlated sub-query method you wrote always wins. Sometimes, not by much, but it always wins.

    So far, the only time the inner join method wins against the correlated subquery is when there is no index on the table.

    The problem is, and I've fought tooth and nail over this with many good people, is that proper indexing is not necessarily the best tool for performance. Under the "right conditions", one or the other code that has been posted, so far, wins the proverbial foot race and THAT'S THE REAL PROBLEM! YOU NEED THE RIGHT CONDITIONS for one or the other to win where performance counts!

    It's not always possible, but, whenever you see that you need to do something to a whole table, you need to look for a solution that runs well under any condition regardless of what the indexes are especially since the addition of indexes can cripple performance on high OLTP tables.

    This problem in this thread is a perfect example of what can happen if you look for such a solution. The code will usually be simpler and easier to read than any of the other solutions and, if ya get real lucky (you frequently will), you find that the "whole table" code will run equally as well with or without indexes. You'll see what i mean in my solution for this code when I show the test code.

    And, that's the fight I've been having with folks... the performance is in the code, no where else. Sure, the use of indexes can help the code, but writing the correct performance enabled code in the first place is the key to performance.

    Crud... I didn't realize that it was 12:30 in the morning here... I've gotta be up at 5:30... I'll come back with the test code tomorrow and a bit of a surprise that supports everything I've said, so far.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Ok... sorry for the delay.

    Here's the test data I used... Just to keep things consistent, I made sure there were no duplicates on the GebID/Load_Dts combination of columns...

    --===== Create and populate a 100,000 row test table.

    -- Column "RowNum" has a range of 1 to 1,000,000 unique numbers

    -- Column "GEBID" has a range of 1 to 50,000 non-unique numbers

    -- Column "LOAD_DTS" has a range of >=01/01/2000 and <01/01/2010 non-unique SMALLDATETIMES

    -- Column "LOAD_END_DTS" contains all NULL SMALLDATETIMES

    -- Jeff Moden

    SELECT TOP 100030 --The "30" approximately makes up for any dupes we'll delete

    RowNum = IDENTITY(INT,1,1),

    GEBID = ISNULL(ABS(CHECKSUM(NEWID()))%50000+1,0),

    LOAD_DTS = ISNULL(CAST(ABS(CHECKSUM(NEWID()))%3653+36524 AS SMALLDATETIME),0),

    LOAD_END_DTS = CAST(NULL AS SMALLDATETIME)

    INTO dbo.JBMTest

    FROM Master.dbo.SysColumns t1,

    Master.dbo.SysColumns t2 --Lack of join criteria makes this a CROSS-JOIN

    --===== Delete any accidental dupes

    DELETE t1

    FROM dbo.JBMTest t1

    INNER JOIN dbo.JBMTest t2

    ON t1.GEBID = t2.GEBID

    AND t1.LOAD_DTS = t2.LOAD_DTS

    AND t1.RowNum < t2.RowNum

    Notice... no indexes at first. Here's the code being tested... there are 3 "entries"... one for Richard's Derived Table method, one for Hans final Correlated Subquery method, and one that uses a Derived Table in a different manner than Richard's. Obviously, the queries all need to have column names or table names changed a bit to make them all work with the same test data. Other than that, nothing changed.

    Here's the code...

    --=======================================================================================

    -- Hans' Correlated Subquery method

    --=======================================================================================

    --===== Identify and setup the run

    PRINT '========== Hans'' Correlated Subquery method =========='

    UPDATE dbo.JBMTest SET Load_End_Dts = NULL --Resets the test and changes cache so every is the same

    SET STATISTICS IO ON

    SET STATISTICS TIME ON

    --===== Run the code under test

    Update s

    Set s.load_end_dts = dateadd(ss,-1,(GetDate()))

    From dbo.JBMTest as s

    Where s.Load_DTS = (SELECT MIN(x.Load_dts) from dbo.JBMTest x

    where x.gebid = s.gebid and x.load_end_dts is NULL)

    and s.gebid IN (SELECT Y.GEBID FROM dbo.JBMTest Y

    WHERE Y.LOAD_END_DTS IS NULL

    GROUP BY Y.GEBID

    HAVING COUNT(Y.LOAD_DTS)>1)

    --===== End the run and mark it's end in the output

    SET STATISTICS TIME OFF

    SET STATISTICS IO OFF

    PRINT REPLICATE('=',100)

    --=======================================================================================

    -- Richard's "Derived table" method

    --=======================================================================================

    --===== Identify and setup the run

    PRINT '========== Richard''s Derived Table method =========='

    UPDATE dbo.JBMTest SET Load_End_Dts = NULL

    SET STATISTICS IO ON

    SET STATISTICS TIME ON

    --===== Run the code under test

    Update s

    Set s.load_end_dts = GetDate()

    From dbo.JBMTest as s

    Join (Select GEBID, Min(Load_Dts) as MinDts From dbo.JBMTest Group By GEBID) as z

    On s.GEBID = z.GEBID And s.Load_Dts = z.MinDts

    Where s.load_end_dts is NULL

    and s.GEBID IN (SELECT Y.GEBID FROM dbo.JBMTest Y

    WHERE Y.LOAD_END_DTS IS NULL

    GROUP BY Y.GEBID

    HAVING COUNT(Y.LOAD_DTS)>1)

    --===== End the run and mark it's end in the output

    SET STATISTICS TIME OFF

    SET STATISTICS IO OFF

    PRINT REPLICATE('=',100)

    --=======================================================================================

    -- "Dark Horse" Derived table method

    --=======================================================================================

    --===== Identify and setup the run

    PRINT '=========="Dark Horse" Derived Table method =========='

    UPDATE dbo.JBMTest SET Load_End_Dts = NULL

    SET STATISTICS IO ON

    SET STATISTICS TIME ON

    --===== Run the code under test

    UPDATE tgt

    SET Load_End_Dts = GETDATE()

    FROM dbo.JBMTest tgt

    INNER JOIN (SELECT GebID, MIN(Load_Dts) AS MinDts

    FROM dbo.JBMTest

    WHERE Load_End_Dts IS NULL

    GROUP BY GebID HAVING COUNT(*) >1) src

    ON tgt.GebID = src.GebID

    AND tgt.Load_Dts = src.MinDts

    --===== End the run and mark it's end in the output

    SET STATISTICS TIME OFF

    SET STATISTICS IO OFF

    PRINT REPLICATE('=',100)

    Notice that I did add some timing and resource usage code outside of each method being demonstrated.

    Now, without any indexes, whatsoever, Richard's derived table method beats Han's correlated subquery method even though it has to do an extra scan...

    ========== Hans' Correlated Subquery method ==========

    (99996 row(s) affected)

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 0 ms.

    (29728 row(s) affected)

    Table 'JBMTest'. Scan count 3, logical reads 30661, physical reads 0, read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 2094 ms, elapsed time = 2861 ms.====================================================================================================

    ========== Richard's Derived Table method ==========

    (99996 row(s) affected)

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 0 ms.

    (29728 row(s) affected)

    Table 'JBMTest'. Scan count 3, logical reads 30661, physical reads 0, read-ahead reads 0.

    Table 'Worktable'. Scan count 1, logical reads 29961, physical reads 0, read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 1468 ms, elapsed time = 1945 ms.

    ====================================================================================================

    =========="Dark Horse" Derived Table method ==========

    (99996 row(s) affected)

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 0 ms.

    (29728 row(s) affected)

    Table 'JBMTest'. Scan count 2, logical reads 30350, physical reads 0, read-ahead reads 0.

    Table 'Worktable'. Scan count 1, logical reads 29961, physical reads 0, read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 1031 ms, elapsed time = 1396 ms.====================================================================================================

    Heh... but look at the "dark horse" at the bottom of the code... it beats both methods.

    There's a lot of "in between" testing with indexes that can be done, but let's jump right to the use of a clustered Primary Key...

    --====== Add a primary key to the unique column combination

    ALTER TABLE dbo.JBMTest

    ADD CONSTRAINT PK_JBMTest_Composite PRIMARY KEY CLUSTERED (GEBID,LOAD_DTS)

    That produces the following...

    ========== Hans' Correlated Subquery method ==========

    (99996 row(s) affected)

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 0 ms.

    (29728 row(s) affected)

    Table 'JBMTest'. Scan count 2, logical reads 60078, physical reads 0, read-ahead reads 0.

    Table 'Worktable'. Scan count 1, logical reads 30006, physical reads 0, read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 1188 ms, elapsed time = 1350 ms.

    ====================================================================================================

    ========== Richard's Derived Table method ==========

    (99996 row(s) affected)

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 0 ms.

    (29728 row(s) affected)

    Table 'JBMTest'. Scan count 3, logical reads 60389, physical reads 0, read-ahead reads 0.

    Table 'Worktable'. Scan count 1, logical reads 30006, physical reads 0, read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 1641 ms, elapsed time = 2018 ms.

    ====================================================================================================

    =========="Dark Horse" Derived Table method ==========

    (99996 row(s) affected)

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 0 ms.

    (29728 row(s) affected)

    Table 'JBMTest'. Scan count 2, logical reads 60078, physical reads 0, read-ahead reads 0.

    Table 'Worktable'. Scan count 1, logical reads 30006, physical reads 0, read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 1031 ms, elapsed time = 1244 ms.

    ====================================================================================================

    That time, the Han's correlated subquery method beat Richard's derived table method. But, OH!, there's that "dark horse" again. He beat both again. In fact, the 3rd method takes the same amount of time with or without an index.

    Let's "flip" the index and try again (just to be fair)...

    ALTER TABLE dbo.JBMTest

    DROP CONSTRAINT PK_JBMTest_Composite

    GO

    ALTER TABLE dbo.JBMTest

    ADD CONSTRAINT PK_JBMTest_Composite PRIMARY KEY CLUSTERED (LOAD_DTS,GEBID)

    That results in the following...

    ========== Hans' Correlated Subquery method ==========

    (99996 row(s) affected)

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 0 ms.

    (29728 row(s) affected)

    Table 'JBMTest'. Scan count 3, logical reads 60389, physical reads 0, read-ahead reads 0.

    Table 'Worktable'. Scan count 1, logical reads 30006, physical reads 0, read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 1328 ms, elapsed time = 1793 ms.

    ====================================================================================================

    ========== Richard's Derived Table method ==========

    (99996 row(s) affected)

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 0 ms.

    (29728 row(s) affected)

    Table 'JBMTest'. Scan count 3, logical reads 60389, physical reads 0, read-ahead reads 0.

    Table 'Worktable'. Scan count 1, logical reads 30006, physical reads 0, read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 1391 ms, elapsed time = 1496 ms.

    ====================================================================================================

    =========="Dark Horse" Derived Table method ==========

    (99996 row(s) affected)

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 0 ms.

    (29728 row(s) affected)

    Table 'JBMTest'. Scan count 2, logical reads 60078, physical reads 0, read-ahead reads 0.

    Table 'Worktable'. Scan count 1, logical reads 30006, physical reads 0, read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 969 ms, elapsed time = 1243 ms.

    ====================================================================================================

    Richard's derived table got a little better and Hans' correlated subquery got a little worse. In fact, they just about tied...

    But, the "dark horse" still wins...

    I tried many different combinations including a covering index which, of course, slowed things down a lot because the clustered index was being updated... but, the "dark horse" still won.

    It's not so important that the "dark horse" wins here... what is important is WHY it wins. The reason why it wins is simple... it uses fewer joins to accomplish exactly the same task.

    That's why i suggested the testing... I wanted to accomplish several things...

    The first was to make people realize that "rules of thumb" don't always apply (Han's correlated subquery did beat Richard's derived table method quite a bit under the "right conditions".)

    Another was to not simply accept an answer. Go behind the scenes and ask things like "are two joins back to the same table (table is referenced a total of 3 times) really necessary to do this?" Once you've done this a couple of times, you'll be able to look at the code and say "I know a better way".

    As a side bar, if anyone knows of a different correlated subquery to accomplish this same task, I'd be happy to test it... I've tried a WHERE IN and WHERE EXISTS but I'm so used to not writing correlated subqueries, I don't even know if there's a way to write a correlated subquery into the SET statement for this problem. I'm always willing to learn something new and frequently do. 🙂

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • blom0344 (8/20/2008)


    I have the following query that is supposed to update a date-column in a table based on the lowest value of another date from the same table for every userid:

    UPDATE dbo.S_TABLE

    SET LOAD_END_DTS = GETDATE()

    WHERE LOAD_END_DTS IS NULL AND

    LOAD_DTS = (SELECT MIN(X.LOAD_DTS) FROM dbo.S_TABLE X

    WHERE USERID = X.USERID)

    AND USERID IN

    (SELECT Y.USERID FROM dbo.S_TABLE Y

    WHERE Y.LOAD_END_DTS IS NULL

    GROUP BY Y.USERID

    HAVING COUNT(Y.LOAD_DTS)>1)

    The bold part is the correlated subquery that should make sure that the action is performed for EACH USERID.

    The second part of the query is just there to avoid updating rows for a USERID that only have one row with LOAD_END_DTS as null.

    However, the correlation does not seems to work as the MIN(X.LOAD_DTS) is established for the entire table and not for each USERID.

    What am I missing, this used to work perfectly in - for instance - Oracle..

    as it stands the query in bold is not correlated as it has no reference to the parent query until you do something like this

    UPDATE dbo.S_TABLE Y

    SET LOAD_END_DTS = GETDATE()

    WHERE LOAD_END_DTS IS NULL AND

    LOAD_DTS = (SELECT MIN(X.LOAD_DTS) FROM dbo.S_TABLE X

    WHERE Y.USERID = X.USERID)

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • That is certainly true, the syntax has to be slightly different though as the alias has to be referenced first to make this work..

  • Or, you could do it the way I showed you and not worry about correlation. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Yes, and thanks for putting all this effort into my post. I've yet to find some time to examine your extensive treatment, but I surely will !!

  • Jeff Moden (9/9/2008)


    Or, you could do it the way I showed you and not worry about correlation. 😉

    even better 😉

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • Heh... thanks for the feedback guys.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 9 posts - 16 through 23 (of 23 total)

You must be logged in to reply to this topic. Login to reply