Need the Price with the Latest Date

  • Craig Farrell (2/2/2011)


    tfifield (2/2/2011)


    At some point as the number of rows increase within the partition the CROSS APPLY will start to pull away from either MAX or ROW_NUMBER() since it will seek rather than have to touch upon each row.

    That makes sense. If you chance upon the article again, I'd love to see the benchmarking.

    Craig,

    I'll try to look up my bench mark tests on this. I seem to recall that the case where ROW_NUMBER() would beat the CROSS APPLY/TOP 1 construct was when the partitioning column had full statistics on it. It seemed that the optimizer knew that there were only a few values within the partition and that a scan with a minimal sort was less overhead than and index seek - even on a covering index.

    In any case, I'll try to find the tests and post them.

    Todd Fifield

  • tfifield (2/3/2011)


    Craig,

    I'll try to look up my bench mark tests on this. I seem to recall that the case where ROW_NUMBER() would beat the CROSS APPLY/TOP 1 construct was when the partitioning column had full statistics on it. It seemed that the optimizer knew that there were only a few values within the partition and that a scan with a minimal sort was less overhead than and index seek - even on a covering index.

    In any case, I'll try to find the tests and post them.

    Todd Fifield

    Todd,

    Did your select look something like this?

    SELECT

    dst.GroupID,

    ca.NumData1,

    ca.NumData2,

    ca.UpdateDate

    FROM

    (SELECT DISTINCT GroupID FROM #TestData) AS dst

    CROSS APPLY

    (SELECT TOP 1 groupId, NumData1, NumData2, UpdateDate

    FROM #TestData AS td2

    WHERE td2.GroupID = dst.GroupID

    ORDER BY UpdateDate DESC) AS ca

    If so, I'll toss it into the benchmarking here with a few options for density and the like and we'll see what happens.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Hm, you appear to have hit on something here... especially depending on the indexing.

    Using this new benchmarking code:

    SET NOCOUNT ON

    IF OBJECT_ID( 'tempdb..#Tally') IS NOT NULL

    DROP TABLE #Tally

    IF OBJECT_ID( 'tempdb..#TimeTests') IS NOT NULL

    DROP TABLE #TimeTests

    CREATE TABLE #TimeTests

    (SettingsIteration INT NOT NULL,

    IndexIteration INT NOT NULL,

    LoopIteration INT NOT NULL,

    StepComment VARCHAR(100) NOT NULL,

    MillisecondDuration BIGINT NOT NULL

    )

    -- Slightly modified from tally article

    SELECT TOP 50000

    IDENTITY(INT,1,1) AS N

    INTO #Tally

    FROM Master.dbo.SysColumns sc1,

    Master.dbo.SysColumns sc2

    ALTER TABLE #Tally

    ADD CONSTRAINT PK_Tally_N

    PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100

    DECLARE @DensityRecordsINT,

    @JunkData1VARCHAR(200),

    @JunkData2VARCHAR(2000),

    @YearsInDateINT,

    @DurationTimeDATETIME,

    @RunTestLoopINT,

    @IndexSettingsLoopINT,

    @DataSettingsLoopINT

    SELECT@DataSettingsLoop = 1,

    @YearsInDate = 20

    WHILE @DataSettingsLoop <=6

    BEGIN

    IF OBJECT_ID( 'tempdb..#TestData') IS NOT NULL

    DROP TABLE #TestData

    CREATE TABLE #TestData

    (GroupID INT,

    NumData1 INT,

    NumData2 INT,

    UpdateDate DATETIME,

    JunkData1 VARCHAR(200),

    JunkData2 VARCHAR(2000)

    )

    IF @DataSettingsLoop = 1

    BEGIN

    SELECT@DensityRecords = 10,

    @JunkData2 = NULL

    END

    ELSE IF @DataSettingsLoop = 2

    BEGIN

    SELECT@DensityRecords = 10,

    @JunkData2 = REPLICATE ( '@', 2000)

    END

    ELSE IF @DataSettingsLoop = 3

    BEGIN

    SELECT@DensityRecords = 1000,

    @JunkData2 = NULL

    END

    ELSE IF @DataSettingsLoop = 4

    BEGIN

    SELECT@DensityRecords = 1000,

    @JunkData2 = REPLICATE ( '@', 2000)

    END

    ELSE IF @DataSettingsLoop = 5

    BEGIN

    SELECT@DensityRecords = 100000,

    @JunkData2 = NULL

    END

    ELSE IF @DataSettingsLoop = 6

    BEGIN

    SELECT@DensityRecords = 100000,

    @JunkData2 = REPLICATE ( '@', 2000)

    END

    --PRINT 'Density Set to: ' + CAST( @DensityRecords AS VARCHAR(15)) + ', with a JunkData2 Fill at ' + CAST( LEN( ISNULL( @JunkData2, '')) AS VARCHAR(10)) + ' characters.'

    INSERT INTO #TestData

    SELECT

    t.N%@DensityRecords AS GroupID, -- Adjust the modulo to decide the # of groups to compare.

    (t.n + 10) * 2 / 3 AS NumData1, -- Simply here as a filler

    t.n AS NumData2, -- Some more filler

    CAST( DATEADD( yyyy, 100, ABS( CHECKSUM(NEWID())% (365 * @YearsInDate))) AS DATETIME) AS UpdateDate, -- Give me a random date

    @JunkData1 AS JunkData1,

    @JunkData2 AS JunkData2

    FROM

    #Tally AS t,

    #Tally AS t2

    WHERE

    t.N <= 50000

    AND t2.N <= 20 --Control your testing rowcount here. 2 for 100k row test, 20 for 1mill.

    -- Now, an intelligent index to make sure we only add to the tail, like a good prod system

    CREATE CLUSTERED INDEX idx_TestData

    ON #TestData

    ( UpdateDate, GroupID)

    SET @IndexSettingsLoop = 1

    WHILE @IndexSettingsLoop <= 2

    BEGIN

    -- A couple of quick metrics

    SET @RunTestLoop = 1

    SET @DurationTime = GETDATE()

    WHILE @RunTestLoop <=5

    BEGIN

    --Print 'TestLoop: ' + CAST( @RunTestLoop AS VARCHAR(3))

    ; WITH cte AS

    ( SELECT

    groupID,

    Max( UpdateDate) AS MaxDate

    FROM

    #TestData --WITH ( INDEX(0))

    GROUP BY

    GroupID

    )

    SELECT

    td.GroupID,

    td.NumData1,

    td.NumData2,

    td.UpdateDate

    FROM

    #TestData AS td

    JOIN

    cte

    ON td.GroupID = cte.GroupID

    AND td.Updatedate = cte.MaxDate

    INSERT INTO #TimeTests (SettingsIteration, IndexIteration, LoopIteration, StepComment, MillisecondDuration)

    VALUES( @DataSettingsLoop, @IndexSettingsLoop, @RunTestLoop, 'MaxDate Method', DATEDIFF( ms, @DurationTime, getdate()))

    --PRINT 'MaxDate Method took: ' + CAST( DATEDIFF( ms, @DurationTime, getdate()) AS VARCHAR(30)) + ' milliseconds.'

    SET @DurationTime = GETDATE()

    -- RowNumber() method

    ; WITH cte AS

    ( SELECT GroupID,

    UpdateDate,

    ROW_NUMBER() OVER ( Partition By GroupID ORDER BY UpdateDate DESC) AS RowNum

    FROM

    #TestData)

    SELECT

    td.GroupID,

    td.NumData1,

    td.NumData2,

    td.UpdateDate

    FROM

    #TestData AS td

    JOIN

    cte

    ON td.GroupID = cte.GroupID

    AND td.UpdateDate = cte.UpdateDate

    where

    cte.Rownum = 1

    INSERT INTO #TimeTests (SettingsIteration, IndexIteration, LoopIteration, StepComment, MillisecondDuration)

    VALUES( @DataSettingsLoop, @IndexSettingsLoop, @RunTestLoop, 'RowNum Method', DATEDIFF( ms, @DurationTime, getdate()))

    --PRINT 'RowNum Method took: ' + CAST( DATEDIFF( ms, @DurationTime, getdate()) AS VARCHAR(30)) + ' milliseconds.'

    SET @DurationTime = GETDATE()

    SELECT

    dst.GroupID,

    ca.NumData1,

    ca.NumData2,

    ca.UpdateDate

    FROM

    (SELECT DISTINCT GroupID FROM #TestData) AS dst

    CROSS APPLY

    (SELECT TOP 1 groupId, NumData1, NumData2, UpdateDate

    FROM #TestData AS td2

    WHERE td2.GroupID = dst.GroupID

    ORDER BY UpdateDate DESC) AS ca

    INSERT INTO #TimeTests (SettingsIteration, IndexIteration, LoopIteration, StepComment, MillisecondDuration)

    VALUES( @DataSettingsLoop, @IndexSettingsLoop, @RunTestLoop, 'CrossApply Method', DATEDIFF( ms, @DurationTime, getdate()))

    --PRINT 'CrossApply Method took: ' + CAST( DATEDIFF( ms, @DurationTime, getdate()) AS VARCHAR(30)) + ' milliseconds.'

    SET @DurationTime = GETDATE()

    SET @RunTestLoop = @RunTestLoop + 1

    END

    SET @RunTestLoop = 1

    -- Create a descending noncluster for our purposes for Row_Number()

    IF @IndexSettingsLoop = 1

    BEGIN

    CREATE NONCLUSTERED INDEX idx_TestData2 ON #TestData

    (GroupID, UpdateDate Desc)

    END

    SET @IndexSettingsLoop = @IndexSettingsLoop + 1

    END

    SET @DataSettingsLoop = @DataSettingsLoop + 1

    END

    /*

    -- Run After above, as maximum 100 grids returned is overwhelmed

    -- Iteration 1 on each loop deals with a caching problem, where whoever's first gets bad result data.

    DELETE FROM #TimeTests WHERE LoopIteration = 1

    SELECT * FROM #TimeTests

    ORDER BY

    SettingsIteration, IndexIteration, LoopIteration, StepComment

    SELECT

    SettingsIteration, IndexIteration, CAST( StepComment AS VARCHAR(25)) AS Step,

    AVG( MillisecondDuration) AS AvgDur

    FROM

    #TimeTests

    GROUP BY

    SettingsIteration, IndexIteration, StepComment

    ORDER BY

    SettingsIteration, IndexIteration, AvgDur

    */

    I get these results:

    1 1 CrossApply Method 25

    1 1 MaxDate Method 36

    1 1 RowNum Method 838

    1 2 CrossApply Method 118

    1 2 MaxDate Method 213

    1 2 RowNum Method 410

    2 1 CrossApply Method 117

    2 1 MaxDate Method 141

    2 1 RowNum Method 1046

    2 2 CrossApply Method 196

    2 2 RowNum Method 353

    2 2 MaxDate Method 372

    3 1 MaxDate Method 49

    3 1 CrossApply Method 96

    3 1 RowNum Method 984

    3 2 MaxDate Method 56

    3 2 CrossApply Method 91

    3 2 RowNum Method 425

    4 1 MaxDate Method 134

    4 1 CrossApply Method 229

    4 1 RowNum Method 1035

    4 2 CrossApply Method 26

    4 2 MaxDate Method 59

    4 2 RowNum Method 342

    5 1 MaxDate Method 195

    5 1 RowNum Method 1062

    5 1 CrossApply Method 3752

    5 2 CrossApply Method 159

    5 2 MaxDate Method 206

    5 2 RowNum Method 515

    6 1 MaxDate Method 252

    6 1 RowNum Method 943

    6 1 CrossApply Method 4311

    6 2 CrossApply Method 148

    6 2 MaxDate Method 173

    6 2 RowNum Method 534


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • drew.allen (2/1/2011)


    Jeff Moden (2/1/2011)


    drew.allen (1/31/2011)


    Craig Farrell (1/31/2011)


    After having done some other tests, this method seems to be the quickest, especially if you have a solid index on Item/ChangePriceDate

    My tests indicate the opposite. I've found that using Row_Number() is faster.

    Drew

    Ah... you have tests. 😉 You should post them. 🙂

    I did post them in another thread http://www.sqlservercentral.com/Forums/FindPost1054915.aspx

    Drew

    Sorry lost track of this thread. Thanks, Drew.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Craig Farrell (2/3/2011)


    Hm, you appear to have hit on something here... especially depending on the indexing.

    Using this new benchmarking code:

    Wow! Lots'o code! What conclusion do you draw from it all, Craig?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (2/4/2011)


    Craig Farrell (2/3/2011)


    Hm, you appear to have hit on something here... especially depending on the indexing.

    Using this new benchmarking code:

    Wow! Lots'o code! What conclusion do you draw from it all, Craig?

    Please note: All testing is being done on the 2k5 optimizer.

    I've modified it a bit since I posted this to force a time onto the date to remove some repitition.

    At poor density heuristics (5-10%) Cross Apply and Max are pretty similar, no matter the indexing. The inclusion of secondary non-clustered indexes require hinting to the primary index though to keep the MAX looking at the right index for its purposes in all heuristics cases, however. The optimizer goes wandering off on the second index and loses some speed.

    Get into lighter densities (1%) and it's going to depend entirely on indexing. MAX wins if you've got a standard audit index, but if you can tune to it, the CrossApply can beat it. The worse the spread on the data in the Clustered index, the worse CrossApply loses without having a tuned NC index to work off of.

    Get into very tight densities (0.1%) and this striation continues, to the point that without a dedicated index the rownumber() method can actually outperform it.

    Row_Number() and CrossApply require an index to help them perform. My usual auditing clustered (audit/change date + common record find method) works with the MAX method out of the box.

    This is going to depend HEAVILY on your indexing and how much you're willing to force hinting. If you force hinting, they're comparable either way.

    However, there's one thing that I haven't added here. MAX() doesn't guarantee a single row return. If your data dupes at the MAX, MAX() returns multiple rows. Row_Number() and Cross Apply do not.

    There's a few more tests I'm planning to do before I consider this conclusive. First, I need to muck with index hinting more. Second, I want to do it for a realistic scenario, such as a price at a specific date (think historic invoicing). Third, I want to add in a second grouping layer and test for primary and secondary groups. Noone ever looks at everything that's the latest in an audit table, they're hunting for something. They'll hunt by specific item, or perhaps across a group of items. That will affect the optimizer heavily.

    For bulk reading, I'd stick with MAX(), simply because of my own personal methodology and less indexing overhead.

    For real world usage... I'll get back to you. 🙂

    EDIT: The only unstruck portion of that is still accurate. I was missing a few significant components in my testing. This is getting to white paper proportions.

    Short version: Go with Cross Apply for now. Todd's right in general. I may reverse this later but everything I said above needs to get chucked out the window, wholesale. I may bring this to a fresh thread and see about getting some of you gurus involved in this. I'm starting to get in too deep.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • drew.allen (2/1/2011)


    Jeff Moden (2/1/2011)


    drew.allen (1/31/2011)


    Craig Farrell (1/31/2011)


    After having done some other tests, this method seems to be the quickest, especially if you have a solid index on Item/ChangePriceDate

    My tests indicate the opposite. I've found that using Row_Number() is faster.

    Drew

    Ah... you have tests. 😉 You should post them. 🙂

    I did post them in another thread http://www.sqlservercentral.com/Forums/FindPost1054915.aspx

    Drew

    Apologies, I never did comment specifically on this point in detail.

    Drew, you're probably correct that Row_Number() is better when you're returning everything but specific rows. In this case, however, we're looking to return isolatable rows, rather then everything but the final row.

    The problem addressed in that other thread is to return everything inside a triangle join (all records prior to the last). MAX() might be optimizable to be faster there, but I'd have to think about it. It's not something that comes up commonly enough for me to have discussed optimization on it.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

Viewing 7 posts - 16 through 21 (of 21 total)

You must be logged in to reply to this topic. Login to reply