Need the Price with the Latest Date

Question

Need the Price with the Latest Date

Viewing 7 posts - 16 through 21 (of 21 total)

You must be logged in to reply to this topic. Login to reply

tfifield SSCrazy Eights Points: 9655 More actions · Answer 1

Craig Farrell (2/2/2011)
tfifield (2/2/2011)
At some point as the number of rows increase within the partition the CROSS APPLY will start to pull away from either MAX or ROW_NUMBER() since it will seek rather than have to touch upon each row.
That makes sense. If you chance upon the article again, I'd love to see the benchmarking.

Craig,

I'll try to look up my bench mark tests on this. I seem to recall that the case where ROW_NUMBER() would beat the CROSS APPLY/TOP 1 construct was when the partitioning column had full statistics on it. It seemed that the optimizer knew that there were only a few values within the partition and that a scan with a minimal sort was less overhead than and index seek - even on a covering index.

In any case, I'll try to find the tests and post them.

Todd Fifield

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 2

tfifield (2/3/2011)
Craig,
I'll try to look up my bench mark tests on this. I seem to recall that the case where ROW_NUMBER() would beat the CROSS APPLY/TOP 1 construct was when the partitioning column had full statistics on it. It seemed that the optimizer knew that there were only a few values within the partition and that a scan with a minimal sort was less overhead than and index seek - even on a covering index.
In any case, I'll try to find the tests and post them.
Todd Fifield

Todd,

Did your select look something like this?

SELECT

dst.GroupID,

ca.NumData1,

ca.NumData2,

ca.UpdateDate

FROM

(SELECT DISTINCT GroupID FROM #TestData) AS dst

CROSS APPLY

(SELECT TOP 1 groupId, NumData1, NumData2, UpdateDate

FROM #TestData AS td2

WHERE td2.GroupID = dst.GroupID

ORDER BY UpdateDate DESC) AS ca

If so, I'll toss it into the benchmarking here with a few options for density and the like and we'll see what happens.

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 3

Hm, you appear to have hit on something here... especially depending on the indexing.

Using this new benchmarking code:

SET NOCOUNT ON

IF OBJECT_ID( 'tempdb..#Tally') IS NOT NULL

DROP TABLE #Tally

IF OBJECT_ID( 'tempdb..#TimeTests') IS NOT NULL

DROP TABLE #TimeTests

CREATE TABLE #TimeTests

(SettingsIteration INT NOT NULL,

IndexIteration INT NOT NULL,

LoopIteration INT NOT NULL,

StepComment VARCHAR(100) NOT NULL,

MillisecondDuration BIGINT NOT NULL

)

-- Slightly modified from tally article

SELECT TOP 50000

IDENTITY(INT,1,1) AS N

INTO #Tally

FROM Master.dbo.SysColumns sc1,

Master.dbo.SysColumns sc2

ALTER TABLE #Tally

ADD CONSTRAINT PK_Tally_N

PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100

DECLARE @DensityRecordsINT,

@JunkData1VARCHAR(200),

@JunkData2VARCHAR(2000),

@YearsInDateINT,

@DurationTimeDATETIME,

@RunTestLoopINT,

@IndexSettingsLoopINT,

@DataSettingsLoopINT

SELECT@DataSettingsLoop = 1,

@YearsInDate = 20

WHILE @DataSettingsLoop <=6

BEGIN

IF OBJECT_ID( 'tempdb..#TestData') IS NOT NULL

DROP TABLE #TestData

CREATE TABLE #TestData

(GroupID INT,

NumData1 INT,

NumData2 INT,

UpdateDate DATETIME,

JunkData1 VARCHAR(200),

JunkData2 VARCHAR(2000)

)

IF @DataSettingsLoop = 1

BEGIN

SELECT@DensityRecords = 10,

@JunkData2 = NULL

END

ELSE IF @DataSettingsLoop = 2

BEGIN

SELECT@DensityRecords = 10,

@JunkData2 = REPLICATE ( '@', 2000)

END

ELSE IF @DataSettingsLoop = 3

BEGIN

SELECT@DensityRecords = 1000,

@JunkData2 = NULL

END

ELSE IF @DataSettingsLoop = 4

BEGIN

SELECT@DensityRecords = 1000,

@JunkData2 = REPLICATE ( '@', 2000)

END

ELSE IF @DataSettingsLoop = 5

BEGIN

SELECT@DensityRecords = 100000,

@JunkData2 = NULL

END

ELSE IF @DataSettingsLoop = 6

BEGIN

SELECT@DensityRecords = 100000,

@JunkData2 = REPLICATE ( '@', 2000)

END

--PRINT 'Density Set to: ' + CAST( @DensityRecords AS VARCHAR(15)) + ', with a JunkData2 Fill at ' + CAST( LEN( ISNULL( @JunkData2, '')) AS VARCHAR(10)) + ' characters.'

INSERT INTO #TestData

SELECT

t.N%@DensityRecords AS GroupID, -- Adjust the modulo to decide the # of groups to compare.

(t.n + 10) * 2 / 3 AS NumData1, -- Simply here as a filler

t.n AS NumData2, -- Some more filler

CAST( DATEADD( yyyy, 100, ABS( CHECKSUM(NEWID())% (365 * @YearsInDate))) AS DATETIME) AS UpdateDate, -- Give me a random date

@JunkData1 AS JunkData1,

@JunkData2 AS JunkData2

FROM

#Tally AS t,

#Tally AS t2

WHERE

t.N <= 50000

AND t2.N <= 20 --Control your testing rowcount here. 2 for 100k row test, 20 for 1mill.

-- Now, an intelligent index to make sure we only add to the tail, like a good prod system

CREATE CLUSTERED INDEX idx_TestData

ON #TestData

( UpdateDate, GroupID)

SET @IndexSettingsLoop = 1

WHILE @IndexSettingsLoop <= 2

BEGIN

-- A couple of quick metrics

SET @RunTestLoop = 1

SET @DurationTime = GETDATE()

WHILE @RunTestLoop <=5

BEGIN

--Print 'TestLoop: ' + CAST( @RunTestLoop AS VARCHAR(3))

; WITH cte AS

( SELECT

groupID,

Max( UpdateDate) AS MaxDate

FROM

#TestData --WITH ( INDEX(0))

GROUP BY

GroupID

)

SELECT

td.GroupID,

td.NumData1,

td.NumData2,

td.UpdateDate

FROM

#TestData AS td

JOIN

cte

ON td.GroupID = cte.GroupID

AND td.Updatedate = cte.MaxDate

INSERT INTO #TimeTests (SettingsIteration, IndexIteration, LoopIteration, StepComment, MillisecondDuration)

VALUES( @DataSettingsLoop, @IndexSettingsLoop, @RunTestLoop, 'MaxDate Method', DATEDIFF( ms, @DurationTime, getdate()))

--PRINT 'MaxDate Method took: ' + CAST( DATEDIFF( ms, @DurationTime, getdate()) AS VARCHAR(30)) + ' milliseconds.'

SET @DurationTime = GETDATE()

-- RowNumber() method

; WITH cte AS

( SELECT GroupID,

UpdateDate,

ROW_NUMBER() OVER ( Partition By GroupID ORDER BY UpdateDate DESC) AS RowNum

FROM

#TestData)

SELECT

td.GroupID,

td.NumData1,

td.NumData2,

td.UpdateDate

FROM

#TestData AS td

JOIN

cte

ON td.GroupID = cte.GroupID

AND td.UpdateDate = cte.UpdateDate

where

cte.Rownum = 1

INSERT INTO #TimeTests (SettingsIteration, IndexIteration, LoopIteration, StepComment, MillisecondDuration)

VALUES( @DataSettingsLoop, @IndexSettingsLoop, @RunTestLoop, 'RowNum Method', DATEDIFF( ms, @DurationTime, getdate()))

--PRINT 'RowNum Method took: ' + CAST( DATEDIFF( ms, @DurationTime, getdate()) AS VARCHAR(30)) + ' milliseconds.'

SET @DurationTime = GETDATE()

SELECT

dst.GroupID,

ca.NumData1,

ca.NumData2,

ca.UpdateDate

FROM

(SELECT DISTINCT GroupID FROM #TestData) AS dst

CROSS APPLY

(SELECT TOP 1 groupId, NumData1, NumData2, UpdateDate

FROM #TestData AS td2

WHERE td2.GroupID = dst.GroupID

ORDER BY UpdateDate DESC) AS ca

INSERT INTO #TimeTests (SettingsIteration, IndexIteration, LoopIteration, StepComment, MillisecondDuration)

VALUES( @DataSettingsLoop, @IndexSettingsLoop, @RunTestLoop, 'CrossApply Method', DATEDIFF( ms, @DurationTime, getdate()))

--PRINT 'CrossApply Method took: ' + CAST( DATEDIFF( ms, @DurationTime, getdate()) AS VARCHAR(30)) + ' milliseconds.'

SET @DurationTime = GETDATE()

SET @RunTestLoop = @RunTestLoop + 1

END

SET @RunTestLoop = 1

-- Create a descending noncluster for our purposes for Row_Number()

IF @IndexSettingsLoop = 1

BEGIN

CREATE NONCLUSTERED INDEX idx_TestData2 ON #TestData

(GroupID, UpdateDate Desc)

END

SET @IndexSettingsLoop = @IndexSettingsLoop + 1

END

SET @DataSettingsLoop = @DataSettingsLoop + 1

END

/*

-- Run After above, as maximum 100 grids returned is overwhelmed

-- Iteration 1 on each loop deals with a caching problem, where whoever's first gets bad result data.

DELETE FROM #TimeTests WHERE LoopIteration = 1

SELECT * FROM #TimeTests

ORDER BY

SettingsIteration, IndexIteration, LoopIteration, StepComment

SELECT

SettingsIteration, IndexIteration, CAST( StepComment AS VARCHAR(25)) AS Step,

AVG( MillisecondDuration) AS AvgDur

FROM

#TimeTests

GROUP BY

SettingsIteration, IndexIteration, StepComment

ORDER BY

SettingsIteration, IndexIteration, AvgDur

*/

I get these results:

1 1 CrossApply Method 25

1 1 MaxDate Method 36

1 1 RowNum Method 838

1 2 CrossApply Method 118

1 2 MaxDate Method 213

1 2 RowNum Method 410

2 1 CrossApply Method 117

2 1 MaxDate Method 141

2 1 RowNum Method 1046

2 2 CrossApply Method 196

2 2 RowNum Method 353

2 2 MaxDate Method 372

3 1 MaxDate Method 49

3 1 CrossApply Method 96

3 1 RowNum Method 984

3 2 MaxDate Method 56

3 2 CrossApply Method 91

3 2 RowNum Method 425

4 1 MaxDate Method 134

4 1 CrossApply Method 229

4 1 RowNum Method 1035

4 2 CrossApply Method 26

4 2 MaxDate Method 59

4 2 RowNum Method 342

5 1 MaxDate Method 195

5 1 RowNum Method 1062

5 1 CrossApply Method 3752

5 2 CrossApply Method 159

5 2 MaxDate Method 206

5 2 RowNum Method 515

6 1 MaxDate Method 252

6 1 RowNum Method 943

6 1 CrossApply Method 4311

6 2 CrossApply Method 148

6 2 MaxDate Method 173

6 2 RowNum Method 534

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 4

drew.allen (2/1/2011)
Jeff Moden (2/1/2011)
drew.allen (1/31/2011)
Craig Farrell (1/31/2011)
After having done some other tests, this method seems to be the quickest, especially if you have a solid index on Item/ChangePriceDate
My tests indicate the opposite. I've found that using Row_Number() is faster.
Drew
Ah... you have tests. 😉 You should post them. 🙂
I did post them in another thread http://www.sqlservercentral.com/Forums/FindPost1054915.aspx
Drew

Sorry lost track of this thread. Thanks, Drew.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004406 More actions · Answer 5

Craig Farrell (2/3/2011)
Hm, you appear to have hit on something here... especially depending on the indexing.
Using this new benchmarking code:

Wow! Lots'o code! What conclusion do you draw from it all, Craig?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 6

Jeff Moden (2/4/2011)
Craig Farrell (2/3/2011)
Hm, you appear to have hit on something here... especially depending on the indexing.
Using this new benchmarking code:
Wow! Lots'o code! What conclusion do you draw from it all, Craig?

~~Please note: All testing is being done on the 2k5 optimizer.~~

~~I've modified it a bit since I posted this to force a time onto the date to remove some repitition.~~

At poor density heuristics (5-10%) Cross Apply and Max are pretty similar, no matter the indexing. The inclusion of secondary non-clustered indexes require hinting to the primary index though to keep the MAX looking at the right index for its purposes in all heuristics cases, however. The optimizer goes wandering off on the second index and loses some speed.

Get into lighter densities (1%) and it's going to depend entirely on indexing. MAX wins if you've got a standard audit index, but if you can tune to it, the CrossApply can beat it. The worse the spread on the data in the Clustered index, the worse CrossApply loses without having a tuned NC index to work off of.

~~Get into very tight densities (0.1%) and this striation continues, to the point that without a dedicated index the rownumber() method can actually outperform it.~~

~~Row_Number() and CrossApply require an index to help them perform. My usual auditing clustered (audit/change date + common record find method) works with the MAX method out of the box.~~

~~This is going to depend HEAVILY on your indexing and how much you're willing to force hinting. If you force hinting, they're comparable either way.~~

~~However, there's one thing that I haven't added here. MAX() doesn't guarantee a single row return. If your data dupes at the MAX, MAX() returns multiple rows. Row_Number() and Cross Apply do not.~~

There's a few more tests I'm planning to do before I consider this conclusive. First, I need to muck with index hinting more. Second, I want to do it for a realistic scenario, such as a price at a specific date (think historic invoicing). Third, I want to add in a second grouping layer and test for primary and secondary groups. Noone ever looks at everything that's the latest in an audit table, they're hunting for something. They'll hunt by specific item, or perhaps across a group of items. That will affect the optimizer heavily.

~~For bulk reading, I'd stick with MAX(), simply because of my own personal methodology and less indexing overhead.~~

For real world usage... I'll get back to you. 🙂

EDIT: The only unstruck portion of that is still accurate. I was missing a few significant components in my testing. This is getting to white paper proportions.

Short version: Go with Cross Apply for now. Todd's right in general. I may reverse this later but everything I said above needs to get chucked out the window, wholesale. I may bring this to a fresh thread and see about getting some of you gurus involved in this. I'm starting to get in too deep.

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 7

drew.allen (2/1/2011)
Jeff Moden (2/1/2011)
drew.allen (1/31/2011)
Craig Farrell (1/31/2011)
After having done some other tests, this method seems to be the quickest, especially if you have a solid index on Item/ChangePriceDate
My tests indicate the opposite. I've found that using Row_Number() is faster.
Drew
Ah... you have tests. 😉 You should post them. 🙂
I did post them in another thread http://www.sqlservercentral.com/Forums/FindPost1054915.aspx
Drew

Apologies, I never did comment specifically on this point in detail.

Drew, you're probably correct that Row_Number() is better when you're returning everything but specific rows. In this case, however, we're looking to return isolatable rows, rather then everything but the final row.

The problem addressed in that other thread is to return everything inside a triangle join (all records prior to the last). MAX() might be optimizable to be faster there, but I'd have to think about it. It's not something that comes up commonly enough for me to have discussed optimization on it.

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA