February 3, 2011 at 2:59 pm
Craig Farrell (2/2/2011)
tfifield (2/2/2011)
At some point as the number of rows increase within the partition the CROSS APPLY will start to pull away from either MAX or ROW_NUMBER() since it will seek rather than have to touch upon each row.That makes sense. If you chance upon the article again, I'd love to see the benchmarking.
Craig,
I'll try to look up my bench mark tests on this. I seem to recall that the case where ROW_NUMBER() would beat the CROSS APPLY/TOP 1 construct was when the partitioning column had full statistics on it. It seemed that the optimizer knew that there were only a few values within the partition and that a scan with a minimal sort was less overhead than and index seek - even on a covering index.
In any case, I'll try to find the tests and post them.
Todd Fifield
February 3, 2011 at 4:14 pm
tfifield (2/3/2011)
Craig,I'll try to look up my bench mark tests on this. I seem to recall that the case where ROW_NUMBER() would beat the CROSS APPLY/TOP 1 construct was when the partitioning column had full statistics on it. It seemed that the optimizer knew that there were only a few values within the partition and that a scan with a minimal sort was less overhead than and index seek - even on a covering index.
In any case, I'll try to find the tests and post them.
Todd Fifield
Todd,
Did your select look something like this?
SELECT
dst.GroupID,
ca.NumData1,
ca.NumData2,
ca.UpdateDate
FROM
(SELECT DISTINCT GroupID FROM #TestData) AS dst
CROSS APPLY
(SELECT TOP 1 groupId, NumData1, NumData2, UpdateDate
FROM #TestData AS td2
WHERE td2.GroupID = dst.GroupID
ORDER BY UpdateDate DESC) AS ca
If so, I'll toss it into the benchmarking here with a few options for density and the like and we'll see what happens.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
February 3, 2011 at 5:15 pm
Hm, you appear to have hit on something here... especially depending on the indexing.
Using this new benchmarking code:
SET NOCOUNT ON
IF OBJECT_ID( 'tempdb..#Tally') IS NOT NULL
DROP TABLE #Tally
IF OBJECT_ID( 'tempdb..#TimeTests') IS NOT NULL
DROP TABLE #TimeTests
CREATE TABLE #TimeTests
(SettingsIteration INT NOT NULL,
IndexIteration INT NOT NULL,
LoopIteration INT NOT NULL,
StepComment VARCHAR(100) NOT NULL,
MillisecondDuration BIGINT NOT NULL
)
-- Slightly modified from tally article
SELECT TOP 50000
IDENTITY(INT,1,1) AS N
INTO #Tally
FROM Master.dbo.SysColumns sc1,
Master.dbo.SysColumns sc2
ALTER TABLE #Tally
ADD CONSTRAINT PK_Tally_N
PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100
DECLARE @DensityRecordsINT,
@JunkData1VARCHAR(200),
@JunkData2VARCHAR(2000),
@YearsInDateINT,
@DurationTimeDATETIME,
@RunTestLoopINT,
@IndexSettingsLoopINT,
@DataSettingsLoopINT
SELECT@DataSettingsLoop = 1,
@YearsInDate = 20
WHILE @DataSettingsLoop <=6
BEGIN
IF OBJECT_ID( 'tempdb..#TestData') IS NOT NULL
DROP TABLE #TestData
CREATE TABLE #TestData
(GroupID INT,
NumData1 INT,
NumData2 INT,
UpdateDate DATETIME,
JunkData1 VARCHAR(200),
JunkData2 VARCHAR(2000)
)
IF @DataSettingsLoop = 1
BEGIN
SELECT@DensityRecords = 10,
@JunkData2 = NULL
END
ELSE IF @DataSettingsLoop = 2
BEGIN
SELECT@DensityRecords = 10,
@JunkData2 = REPLICATE ( '@', 2000)
END
ELSE IF @DataSettingsLoop = 3
BEGIN
SELECT@DensityRecords = 1000,
@JunkData2 = NULL
END
ELSE IF @DataSettingsLoop = 4
BEGIN
SELECT@DensityRecords = 1000,
@JunkData2 = REPLICATE ( '@', 2000)
END
ELSE IF @DataSettingsLoop = 5
BEGIN
SELECT@DensityRecords = 100000,
@JunkData2 = NULL
END
ELSE IF @DataSettingsLoop = 6
BEGIN
SELECT@DensityRecords = 100000,
@JunkData2 = REPLICATE ( '@', 2000)
END
--PRINT 'Density Set to: ' + CAST( @DensityRecords AS VARCHAR(15)) + ', with a JunkData2 Fill at ' + CAST( LEN( ISNULL( @JunkData2, '')) AS VARCHAR(10)) + ' characters.'
INSERT INTO #TestData
SELECT
t.N%@DensityRecords AS GroupID, -- Adjust the modulo to decide the # of groups to compare.
(t.n + 10) * 2 / 3 AS NumData1, -- Simply here as a filler
t.n AS NumData2, -- Some more filler
CAST( DATEADD( yyyy, 100, ABS( CHECKSUM(NEWID())% (365 * @YearsInDate))) AS DATETIME) AS UpdateDate, -- Give me a random date
@JunkData1 AS JunkData1,
@JunkData2 AS JunkData2
FROM
#Tally AS t,
#Tally AS t2
WHERE
t.N <= 50000
AND t2.N <= 20 --Control your testing rowcount here. 2 for 100k row test, 20 for 1mill.
-- Now, an intelligent index to make sure we only add to the tail, like a good prod system
CREATE CLUSTERED INDEX idx_TestData
ON #TestData
( UpdateDate, GroupID)
SET @IndexSettingsLoop = 1
WHILE @IndexSettingsLoop <= 2
BEGIN
-- A couple of quick metrics
SET @RunTestLoop = 1
SET @DurationTime = GETDATE()
WHILE @RunTestLoop <=5
BEGIN
--Print 'TestLoop: ' + CAST( @RunTestLoop AS VARCHAR(3))
; WITH cte AS
( SELECT
groupID,
Max( UpdateDate) AS MaxDate
FROM
#TestData --WITH ( INDEX(0))
GROUP BY
GroupID
)
SELECT
td.GroupID,
td.NumData1,
td.NumData2,
td.UpdateDate
FROM
#TestData AS td
JOIN
cte
ON td.GroupID = cte.GroupID
AND td.Updatedate = cte.MaxDate
INSERT INTO #TimeTests (SettingsIteration, IndexIteration, LoopIteration, StepComment, MillisecondDuration)
VALUES( @DataSettingsLoop, @IndexSettingsLoop, @RunTestLoop, 'MaxDate Method', DATEDIFF( ms, @DurationTime, getdate()))
--PRINT 'MaxDate Method took: ' + CAST( DATEDIFF( ms, @DurationTime, getdate()) AS VARCHAR(30)) + ' milliseconds.'
SET @DurationTime = GETDATE()
-- RowNumber() method
; WITH cte AS
( SELECT GroupID,
UpdateDate,
ROW_NUMBER() OVER ( Partition By GroupID ORDER BY UpdateDate DESC) AS RowNum
FROM
#TestData)
SELECT
td.GroupID,
td.NumData1,
td.NumData2,
td.UpdateDate
FROM
#TestData AS td
JOIN
cte
ON td.GroupID = cte.GroupID
AND td.UpdateDate = cte.UpdateDate
where
cte.Rownum = 1
INSERT INTO #TimeTests (SettingsIteration, IndexIteration, LoopIteration, StepComment, MillisecondDuration)
VALUES( @DataSettingsLoop, @IndexSettingsLoop, @RunTestLoop, 'RowNum Method', DATEDIFF( ms, @DurationTime, getdate()))
--PRINT 'RowNum Method took: ' + CAST( DATEDIFF( ms, @DurationTime, getdate()) AS VARCHAR(30)) + ' milliseconds.'
SET @DurationTime = GETDATE()
SELECT
dst.GroupID,
ca.NumData1,
ca.NumData2,
ca.UpdateDate
FROM
(SELECT DISTINCT GroupID FROM #TestData) AS dst
CROSS APPLY
(SELECT TOP 1 groupId, NumData1, NumData2, UpdateDate
FROM #TestData AS td2
WHERE td2.GroupID = dst.GroupID
ORDER BY UpdateDate DESC) AS ca
INSERT INTO #TimeTests (SettingsIteration, IndexIteration, LoopIteration, StepComment, MillisecondDuration)
VALUES( @DataSettingsLoop, @IndexSettingsLoop, @RunTestLoop, 'CrossApply Method', DATEDIFF( ms, @DurationTime, getdate()))
--PRINT 'CrossApply Method took: ' + CAST( DATEDIFF( ms, @DurationTime, getdate()) AS VARCHAR(30)) + ' milliseconds.'
SET @DurationTime = GETDATE()
SET @RunTestLoop = @RunTestLoop + 1
END
SET @RunTestLoop = 1
-- Create a descending noncluster for our purposes for Row_Number()
IF @IndexSettingsLoop = 1
BEGIN
CREATE NONCLUSTERED INDEX idx_TestData2 ON #TestData
(GroupID, UpdateDate Desc)
END
SET @IndexSettingsLoop = @IndexSettingsLoop + 1
END
SET @DataSettingsLoop = @DataSettingsLoop + 1
END
/*
-- Run After above, as maximum 100 grids returned is overwhelmed
-- Iteration 1 on each loop deals with a caching problem, where whoever's first gets bad result data.
DELETE FROM #TimeTests WHERE LoopIteration = 1
SELECT * FROM #TimeTests
ORDER BY
SettingsIteration, IndexIteration, LoopIteration, StepComment
SELECT
SettingsIteration, IndexIteration, CAST( StepComment AS VARCHAR(25)) AS Step,
AVG( MillisecondDuration) AS AvgDur
FROM
#TimeTests
GROUP BY
SettingsIteration, IndexIteration, StepComment
ORDER BY
SettingsIteration, IndexIteration, AvgDur
*/
I get these results:
1 1 CrossApply Method 25
1 1 MaxDate Method 36
1 1 RowNum Method 838
1 2 CrossApply Method 118
1 2 MaxDate Method 213
1 2 RowNum Method 410
2 1 CrossApply Method 117
2 1 MaxDate Method 141
2 1 RowNum Method 1046
2 2 CrossApply Method 196
2 2 RowNum Method 353
2 2 MaxDate Method 372
3 1 MaxDate Method 49
3 1 CrossApply Method 96
3 1 RowNum Method 984
3 2 MaxDate Method 56
3 2 CrossApply Method 91
3 2 RowNum Method 425
4 1 MaxDate Method 134
4 1 CrossApply Method 229
4 1 RowNum Method 1035
4 2 CrossApply Method 26
4 2 MaxDate Method 59
4 2 RowNum Method 342
5 1 MaxDate Method 195
5 1 RowNum Method 1062
5 1 CrossApply Method 3752
5 2 CrossApply Method 159
5 2 MaxDate Method 206
5 2 RowNum Method 515
6 1 MaxDate Method 252
6 1 RowNum Method 943
6 1 CrossApply Method 4311
6 2 CrossApply Method 148
6 2 MaxDate Method 173
6 2 RowNum Method 534
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
February 4, 2011 at 6:35 am
drew.allen (2/1/2011)
Jeff Moden (2/1/2011)
drew.allen (1/31/2011)
Craig Farrell (1/31/2011)
After having done some other tests, this method seems to be the quickest, especially if you have a solid index on Item/ChangePriceDateMy tests indicate the opposite. I've found that using Row_Number() is faster.
Drew
Ah... you have tests. 😉 You should post them. 🙂
I did post them in another thread http://www.sqlservercentral.com/Forums/FindPost1054915.aspx
Drew
Sorry lost track of this thread. Thanks, Drew.
--Jeff Moden
Change is inevitable... Change for the better is not.
February 4, 2011 at 6:38 am
Craig Farrell (2/3/2011)
Hm, you appear to have hit on something here... especially depending on the indexing.Using this new benchmarking code:
Wow! Lots'o code! What conclusion do you draw from it all, Craig?
--Jeff Moden
Change is inevitable... Change for the better is not.
February 4, 2011 at 10:32 am
Jeff Moden (2/4/2011)
Craig Farrell (2/3/2011)
Hm, you appear to have hit on something here... especially depending on the indexing.Using this new benchmarking code:
Wow! Lots'o code! What conclusion do you draw from it all, Craig?
Please note: All testing is being done on the 2k5 optimizer.
I've modified it a bit since I posted this to force a time onto the date to remove some repitition.
At poor density heuristics (5-10%) Cross Apply and Max are pretty similar, no matter the indexing. The inclusion of secondary non-clustered indexes require hinting to the primary index though to keep the MAX looking at the right index for its purposes in all heuristics cases, however. The optimizer goes wandering off on the second index and loses some speed.
Get into lighter densities (1%) and it's going to depend entirely on indexing. MAX wins if you've got a standard audit index, but if you can tune to it, the CrossApply can beat it. The worse the spread on the data in the Clustered index, the worse CrossApply loses without having a tuned NC index to work off of.
Get into very tight densities (0.1%) and this striation continues, to the point that without a dedicated index the rownumber() method can actually outperform it.
Row_Number() and CrossApply require an index to help them perform. My usual auditing clustered (audit/change date + common record find method) works with the MAX method out of the box.
This is going to depend HEAVILY on your indexing and how much you're willing to force hinting. If you force hinting, they're comparable either way.
However, there's one thing that I haven't added here. MAX() doesn't guarantee a single row return. If your data dupes at the MAX, MAX() returns multiple rows. Row_Number() and Cross Apply do not.
There's a few more tests I'm planning to do before I consider this conclusive. First, I need to muck with index hinting more. Second, I want to do it for a realistic scenario, such as a price at a specific date (think historic invoicing). Third, I want to add in a second grouping layer and test for primary and secondary groups. Noone ever looks at everything that's the latest in an audit table, they're hunting for something. They'll hunt by specific item, or perhaps across a group of items. That will affect the optimizer heavily.
For bulk reading, I'd stick with MAX(), simply because of my own personal methodology and less indexing overhead.
For real world usage... I'll get back to you. 🙂
EDIT: The only unstruck portion of that is still accurate. I was missing a few significant components in my testing. This is getting to white paper proportions.
Short version: Go with Cross Apply for now. Todd's right in general. I may reverse this later but everything I said above needs to get chucked out the window, wholesale. I may bring this to a fresh thread and see about getting some of you gurus involved in this. I'm starting to get in too deep.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
February 4, 2011 at 1:10 pm
drew.allen (2/1/2011)
Jeff Moden (2/1/2011)
drew.allen (1/31/2011)
Craig Farrell (1/31/2011)
After having done some other tests, this method seems to be the quickest, especially if you have a solid index on Item/ChangePriceDateMy tests indicate the opposite. I've found that using Row_Number() is faster.
Drew
Ah... you have tests. 😉 You should post them. 🙂
I did post them in another thread http://www.sqlservercentral.com/Forums/FindPost1054915.aspx
Drew
Apologies, I never did comment specifically on this point in detail.
Drew, you're probably correct that Row_Number() is better when you're returning everything but specific rows. In this case, however, we're looking to return isolatable rows, rather then everything but the final row.
The problem addressed in that other thread is to return everything inside a triangle join (all records prior to the last). MAX() might be optimizable to be faster there, but I'd have to think about it. It's not something that comes up commonly enough for me to have discussed optimization on it.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
Viewing 7 posts - 16 through 21 (of 21 total)
You must be logged in to reply to this topic. Login to reply