APPLY versus ROW_NUMBER versus double JOIN to retrieve most recent record from a foreign key table

Question

Post reply

APPLY versus ROW_NUMBER versus double JOIN to retrieve most recent record from a foreign key table

Matthew Darwin

SSCertifiable

Points: 5513
More actions
June 6, 2013 at 8:52 am

#294290

I've been mulling over this for a while, and there's something about using the APPLY function (OUTER APPLY in this case) that bugs me in that it seems to force the contents of the apply to be executed as many times as there are input rows. I'm sure I've heard people say this shouldn't be the case, so I've finally got round to getting all the sample code and such together to find out where I'm going wrong.
Basically, I have two tables, ServicerHistory and ServicerHistoryCommissionRate. ServicerHistory contains details of an account and who is servicing it (I've only bothered including the ServicerHistoryID in the sample below as the rest isn't pertinent to this example), whilst ServicerHistoryCommissionRate contains the effective date of the commission rates (if there are any) applied to that account.
DDL for the tables and indexes:-
--create objects
IF OBJECT_ID('dbo.ServicerHistoryCommissionRate') IS NOT NULL
DROP TABLE dbo.ServicerHistoryCommissionRate
IF OBJECT_ID('dbo.ServicerHistory') IS NOT NULL
DROP TABLE dbo.ServicerHistory
CREATE TABLE dbo.ServicerHistory
(ServicerHistoryID int IDENTITY(1,1) PRIMARY KEY)
GO
CREATE TABLE dbo.ServicerHistoryCommissionRate
(ServicerHistoryCommissionRateID int IDENTITY(1,1) PRIMARY KEY
, ServicerHistoryID int NOT NULL FOREIGN KEY REFERENCES dbo.ServicerHistory(ServicerHistoryID)
, CommissionRate decimal(5,2) NOT NULL
, EffectiveDate date NOT NULL)
GO
CREATE INDEX idx_ServicerHistoryID_EffectiveDate
ON dbo.ServicerHistoryCommissionrate (ServicerHistoryID, EffectiveDate DESC)
INCLUDE (CommissionRate)
GO
And some data population (adjust the executions as you feel fit; :-
--populate with some sample data
INSERT INTO dbo.ServicerHistory
DEFAULT VALUES
GO 100000
--set up some dates
DECLARE @Dates TABLE
(SomeDate date NOT NULL)
INSERT INTO @Dates
(SomeDate)
SELECT '2011-07-12'
UNION ALL
SELECT '2013-03-02'
UNION ALL
SELECT '2010-08-13'
UNION ALL
SELECT '2011-01-02'
UNION ALL
SELECT '2013-05-03'
UNION ALL
SELECT '2009-12-18'
--set up commission rates
DECLARE @CommRates TABLE
(Commissionrate decimal(5,2))
INSERT INTO @CommRates
(Commissionrate)
SELECT 32.35
UNION ALL
SELECT 25
UNION ALL
SELECT 12.8
UNION ALL
SELECT 10
UNION ALL
SELECT 7.4
--populate some servicer history ids with valid dates and commission rates
INSERT INTO dbo.ServicerHistoryCommissionRate
SELECT TOP 20000 sh.ServicerHistoryID
, Commrate.CommissionRate
, Dates.SomeDate
FROM dbo.ServicerHistory sh
CROSS APPLY (SELECT TOP 1 CommissionRate
FROM @CommRates AS cr
ORDER BY NEWID()) Commrate
CROSS APPLY (SELECT TOP 1 SomeDate
FROM @Dates AS d
ORDER BY NEWID()) Dates
LEFT OUTER JOIN dbo.ServicerHistoryCommissionRate shcr
ON sh.ServicerHistoryID = shcr.ServicerHistoryID
AND dates.SomeDate = shcr.EffectiveDate
WHERE shcr.ServicerHistoryCommissionRateID IS NULL
ORDER BY NEWID()
GO 50
The three methods of retrieving the data that I've come up with are below:-
--retrieve the most recent commission rate for each row using outer apply (there may not be a rate)
--uh oh, the subquery executed onece for each row in ServicerHistory table
SELECT sh.ServicerHistoryId
, shrc.CommissionRate
FROM dbo.ServicerHistory AS sh
OUTER APPLY (SELECT TOP 1
SHRC.CommissionRate
FROM dbo.ServicerHistoryCommissionRate AS SHRC
WHERE SHRC.ServicerHistoryID = SH.ServicerHistoryId
ORDER BY SHRC.EffectiveDate DESC
) SHRC
--try using ROW_NUMBER, whoo each table only touched once
SELECT sh.ServicerHistoryId
, shrc.CommissionRate
FROM dbo.ServicerHistory AS sh
LEFT OUTER JOIN (SELECT SHRC.CommissionRate
, ServicerHistoryId
, ROW_NUMBER() OVER (PARTITION BY ServicerHistoryID ORDER BY EffectiveDate DESC) RowNum
FROM dbo.ServicerHistoryCommissionRate AS SHRC
) SHRC
ON sh.ServicerHistoryId = shrc.ServicerHistoryID
AND SHRC.Rownum = 1
--try old fashioned double join method, SErvicerHistoryCommissionRate table touched twice so more expensive
SELECT sh.ServicerHistoryID
, shrc.CommissionRate
FROM dbo.ServicerHistory AS sh
LEFT OUTER JOIN (SELECT commid.ServicerHistoryID
, MAX(ServicerHistoryCommissionRateID) ServicerHistoryCommissionRateID
FROM dbo.ServicerHistoryCommissionRate commid
LEFT OUTER JOIN (SELECT ServicerHistoryID
,MAX(EffectiveDate) MaxDate
FROM dbo.ServicerHistoryCommissionRate
GROUP BY ServicerHistoryID) commdate
ON commid.ServicerHistoryId = commdate.ServicerHistoryID
GROUP BY commid.ServicerHistoryID)comm
ON sh.ServicerHistoryId = comm.servicerHistoryID
LEFT OUTER JOIN dbo.ServicerHistoryCommissionRate shrc
ON comm.ServicerHistoryCommissionRateID = shrc.ServicerHistoryCommissionRateID
Statistics time and io return the following:-
(100000 row(s) affected)
Table 'ServicerHistoryCommissionRate'. Scan count 100000, logical reads 319323, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'ServicerHistory'. Scan count 7, logical reads 484, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 639 ms, elapsed time = 845 ms.
(100000 row(s) affected)
Table 'ServicerHistoryCommissionRate'. Scan count 1, logical reads 1852, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'ServicerHistory'. Scan count 1, logical reads 163, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 327 ms, elapsed time = 1040 ms.
(100000 row(s) affected)
Table 'ServicerHistory'. Scan count 7, logical reads 484, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'ServicerHistoryCommissionRate'. Scan count 14, logical reads 4098, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 767 ms, elapsed time = 1847 ms.
The execution plans for the above are attached; as you can see, the first query using the APPLY method executes the index seek on the ServicerHistoryCommissionRate 100k times (for the actual data this comes from there are several million rows in each table so it just gets worse!), making this seem to be by far the most inefficient method of the three when returning a set.
Am I missing something with the way in which to implement using APPLY when dealing with sets, or is it just not good for that kind of query? I like APPLY from a "readable code" point of view, but for performance it just seems to be a bit horrible unless you're working with a very small set of rows.
Any tips, pointers or observations are most welcome!
Follow me on twitter @EvoDBACheck out my blog Natural Selection DBA[/url]

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 1

Matthew Darwin (6/6/2013)
...
Any tips, pointers or observations are most welcome!

Have you tried using the row_number method with the outer apply so you have an even playing field?

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 2

SELECT

sh.ServicerHistoryId,

shrc.CommissionRate

FROM dbo.ServicerHistory sh

OUTER APPLY (

SELECT

d.CommissionRate

FROM (

SELECT

ishrc.CommissionRate,

RowNum = ROW_NUMBER() OVER (PARTITION BY ServicerHistoryID ORDER BY EffectiveDate DESC)

FROM dbo.ServicerHistoryCommissionRate ishrc

WHERE ishrc = ServicerHistoryID = sh.ServicerHistoryID

) d

WHERE d.Rownum = 1

) shrc

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Matthew Darwin SSCertifiable Points: 5513 More actions · Answer 3

I've not; but I think that's a little bit of my point - it isn't an even playing field, and despite APPLY seeming to be designed for this kind of use, in reality it doesn't fare so well. I'm guessing that the use of APPLY will make it worse because instead of simply running 100k SELECT TOP 1 queries, you're going to run 100k queries generating row numbers for each subquery. Whereas using a join you only need to have a single pass of generating the row numbers, especially seeing as you can't filter on the output of the row number until after it's been produced.

I'm assuming you mean something like this:-

SELECT sh.ServicerHistoryId

, shrc.CommissionRate

FROM dbo.ServicerHistory AS sh

OUTER APPLY (SELECT SHRC.CommissionRate

, ServicerHistoryId

, ROW_NUMBER() OVER (PARTITION BY ServicerHistoryID ORDER BY EffectiveDate DESC) RowNum

FROM dbo.ServicerHistoryCommissionRate AS SHRC

WHERE SHRC.ServicerHistoryID = SH.ServicerHistoryId

) SHRC

WHERE SHRC.Rownum = 1

Follow me on twitter @EvoDBACheck out my blog Natural Selection DBA[/url]

Jason-299789 SSC-Insane Points: 21601 More actions · Answer 4

Surely this is because the Outer apply cointains a filter based on the outer table which will cause a rBAR operation as each row has to be evaluated.

The Left outer is a set operation.

_________________________________________________________________________
SSC Guide to Posting and Best Practices

Matthew Darwin SSCertifiable Points: 5513 More actions · Answer 5

That's entirely my conclusion; however I've had people tell me that it shouldn't do that (unless they're using it in another way that means the filter isn't applied in that manner which may well be my problem.)

It just seems to make the apply operator a little pointless to me from a performance stance.

I have a sneaky suspicion I may just not quite be getting it though. 😀

Follow me on twitter @EvoDBACheck out my blog Natural Selection DBA[/url]

HowardW One Orange Chip Points: 29109 More actions · Answer 6

Apply is definitely not the fasted method in any scenario, but there are plenty where it can outstrip other approaches.

In this particular comparison against Row_Number, although it's not the most efficient, note that it's the only one of the three that enables a seek against your index. In other distributions of test data, this could well end up being the most efficient method.

Unless the query can be rewritten by the optimiser into a JOIN (in which case, it becomes identical), you will always end up with a nested loop plan that looks like this. Remember that APPLY is also an efficient way of calling TVFs (CLR and T-SQL), including XMLQuery and that's one of it's primary uses.

I'm not sure I'd define something that used Nested Loops + Index Seeks within a plan as RBAR. So does a JOIN in many scenarios.

HowardW One Orange Chip Points: 29109 More actions · Answer 7

If we skew your test data so there's a lot of rows per "ServicerHistoryID" that are not the top 1, you can see the dramatic switch in the cost of the 3 queries.

Test data setup:

--create objects

IF OBJECT_ID('dbo.ServicerHistoryCommissionRate') IS NOT NULL

DROP TABLE dbo.ServicerHistoryCommissionRate

IF OBJECT_ID('dbo.ServicerHistory') IS NOT NULL

DROP TABLE dbo.ServicerHistory

CREATE TABLE dbo.ServicerHistory

(ServicerHistoryID int IDENTITY(1,1) PRIMARY KEY, SomeOthercolumn BIT NULL)

GO

CREATE TABLE dbo.ServicerHistoryCommissionRate

(ServicerHistoryCommissionRateID int IDENTITY(1,1) PRIMARY KEY

, ServicerHistoryID int NOT NULL FOREIGN KEY REFERENCES dbo.ServicerHistory(ServicerHistoryID)

, CommissionRate decimal(5,2) NOT NULL

, EffectiveDate date NOT NULL)

GO

CREATE INDEX idx_ServicerHistoryID_EffectiveDate

ON dbo.ServicerHistoryCommissionrate (ServicerHistoryID, EffectiveDate DESC)

INCLUDE (CommissionRate)

GO

--populate with some sample data --go 100000 was taking ages, so added a meaningless column to do this in a set based way :)

INSERT INTO dbo.ServicerHistory

SELECT TOP 1000 CAST(NULL AS BIT) FROM master.sys.columns a, master.sys.columns b

--set up some dates (added a lot more dates to skew the data distribution more towards seeks winning)

DECLARE @Dates TABLE

(SomeDate date NOT NULL)

INSERT INTO @Dates

(SomeDate)

SELECT DATEADD(day, n,'20100101') FROM (

SELECT TOP 3000 ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) n FROM master.sys.columns a, master.sys.columns b)

nums

--populate some servicer history ids with valid dates and commission rates

INSERT INTO dbo.ServicerHistoryCommissionRate

SELECT sh.ServicerHistoryID

, CHECKSUM(NEWID())/100000000.0 CommissionRate

, Dates.SomeDate

FROM dbo.ServicerHistory sh

CROSS APPLY (SELECT TOP 1000 SomeDate

FROM @Dates AS d

ORDER BY NEWID()) Dates

LEFT OUTER JOIN dbo.ServicerHistoryCommissionRate shcr

ON sh.ServicerHistoryID = shcr.ServicerHistoryID

AND dates.SomeDate = shcr.EffectiveDate

WHERE shcr.ServicerHistoryCommissionRateID IS NULL

Three queries identical to the originals. Plans attached.

--APPLY
(1000 row(s) affected)
Table 'ServicerHistoryCommissionRate'. Scan count 1000, logical reads 3213, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'ServicerHistory'. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(1 row(s) affected)
SQL Server Execution Times:
CPU time = 16 ms, elapsed time = 15 ms.
--ROW_NUMBER()
(1000 row(s) affected)
Table 'ServicerHistoryCommissionRate'. Scan count 1, logical reads 3137, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'ServicerHistory'. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(1 row(s) affected)
SQL Server Execution Times:
CPU time = 468 ms, elapsed time = 475 ms.
--Double Join
(1000 row(s) affected)
Table 'ServicerHistoryCommissionRate'. Scan count 1, logical reads 6208, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'ServicerHistory'. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(1 row(s) affected)
SQL Server Execution Times:
CPU time = 437 ms, elapsed time = 440 ms.

Matthew Darwin SSCertifiable Points: 5513 More actions · Answer 8

Ok, now that's definitely interesting, I'll try it on some larger datasets later with that sort of skew too and post back (just done this and it stays consistent with your findings).

I had also noticed that if I take a small set of ServicerHistoryIDs as an input to the main ServicerHistory table, then the APPLY operator worked better; I'm assuming because the ROW_NUMBER function still has to be calculated for the entire ServicerHistory table, whereas you only need to run the TOP 1 in the APPLY for the filtered set. In those cases, the double join is also quicker than the ROW_NUMBER version.

DECLARE @ServicerHistoryIDs TABLE

(ServicerHistoryID int PRIMARY KEY)

INSERT INTO @ServicerHistoryIDs

SELECT TOP 100 ServicerHistoryID

FROM dbo.SErvicerHistory

--retrieve the most recent commission rate for each row using outer apply (there may not be a rate)

SELECT sh.ServicerHistoryId

, shrc.CommissionRate

FROM dbo.ServicerHistory AS sh

INNER JOIN @ServicerHistoryIDs AS shid

ON sh.ServicerHistoryID = shid.SErvicerHistoryID

OUTER APPLY (SELECT TOP 1

SHRC.CommissionRate

FROM dbo.ServicerHistoryCommissionRate AS SHRC

WHERE SHRC.ServicerHistoryID = SH.ServicerHistoryId

ORDER BY SHRC.EffectiveDate DESC

) SHRC

--try using ROW_NUMBER

SELECT sh.ServicerHistoryId

, shrc.CommissionRate

FROM dbo.ServicerHistory AS sh

INNER JOIN @ServicerHistoryIDs AS shid

ON sh.ServicerHistoryID = shid.SErvicerHistoryID

LEFT OUTER JOIN (SELECT SHRC.CommissionRate

, ServicerHistoryId

, ROW_NUMBER() OVER (PARTITION BY ServicerHistoryID ORDER BY EffectiveDate DESC) RowNum

FROM dbo.ServicerHistoryCommissionRate AS SHRC

) SHRC

ON sh.ServicerHistoryId = shrc.ServicerHistoryID

AND SHRC.Rownum = 1

--try old fashioned double join method

SELECT sh.ServicerHistoryID

, shrc.CommissionRate

FROM dbo.ServicerHistory AS sh

INNER JOIN @ServicerHistoryIDs AS shid

ON sh.ServicerHistoryID = shid.SErvicerHistoryID

LEFT OUTER JOIN (SELECT commid.ServicerHistoryID

, MAX(ServicerHistoryCommissionRateID) ServicerHistoryCommissionRateID

FROM dbo.ServicerHistoryCommissionRate commid

LEFT OUTER JOIN (SELECT ServicerHistoryID

,MAX(EffectiveDate) MaxDate

FROM dbo.ServicerHistoryCommissionRate

GROUP BY ServicerHistoryID) commdate

ON commid.ServicerHistoryId = commdate.ServicerHistoryID

GROUP BY commid.ServicerHistoryID)comm

ON sh.ServicerHistoryId = comm.servicerHistoryID

LEFT OUTER JOIN dbo.ServicerHistoryCommissionRate shrc

ON comm.ServicerHistoryCommissionRateID = shrc.ServicerHistoryCommissionRateID

(100 row(s) affected)

Table 'ServicerHistoryCommissionRate'. Scan count 100, logical reads 300, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'ServicerHistory'. Scan count 0, logical reads 200, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table '#17B73214'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 0 ms, elapsed time = 0 ms.

(100 row(s) affected)

Table 'ServicerHistoryCommissionRate'. Scan count 1, logical reads 10, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'ServicerHistory'. Scan count 0, logical reads 200, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table '#17B73214'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 0 ms, elapsed time = 0 ms.

(100 row(s) affected)

Table 'ServicerHistoryCommissionRate'. Scan count 100, logical reads 601, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'ServicerHistory'. Scan count 0, logical reads 200, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table '#17B73214'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 0 ms, elapsed time = 0 ms.

Hm, so as per the norm, the answer to which performs best is "it depends". Ho-hum.

Follow me on twitter @EvoDBACheck out my blog Natural Selection DBA[/url]

HowardW One Orange Chip Points: 29109 More actions · Answer 9

Matthew Darwin (6/7/2013)
I had also noticed that if I take a small set of ServicerHistoryIDs as an input to the main ServicerHistory table, then the APPLY operator worked better; I'm assuming because the ROW_NUMBER function still has to be calculated for the entire ServicerHistory table, whereas you only need to run the TOP 1 in the APPLY for the filtered set. In those cases, the double join is also quicker than the ROW_NUMBER version.

Yep. Basically, filtering on the result of one of the analytical functions (only selecting where row_number = 1) AFAIK, will always result in a scan across all rows in which row_number is applied. Therefore, it's likely to be the most efficient method where more than a percent or so of the overall number of rows match your filter. Seek + Key lookup is more expensive than people assume and will only "win" where there's less than a percent (Approx. As usual, it depends) of the overall number of rows that you're retrieving.

In the case where the optimiser can make this decision itself, e.g. with regular joins and predicates, it will choose whether to seek or scan based on the statistics (check out Gail's blog here[/url]), but in this kind of scenario, it's not able (or not clever enough) to do this. Therefore, knowing the distribution of data and being able to make an educated decision about whether to use ROW_NUMBER, APPLY, or something else, can be critical.