Efficiency of a "Most Recent for Each" join

Question

Post reply

Efficiency of a "Most Recent for Each" join

jeffkretz

SSC Eights!

Points: 978
More actions
November 30, 2011 at 1:01 pm

#247004

I often perform reports which require joining the most recent of a many-to-one table joined back to the main. I've been using the below pattern for this type of query, but I was wondering if there is a better approach.
Any feedback would be greatly appreciated.
SELECT a.CustomerID, a.CustomerName, ..., b.CallDate, b.CallerID, b.CallDuration, ...
FROM Customers a
LEFT OUTER JOIN
CustomerCalls b
ON b.ID =
(
SELECT TOP 1 ID
FROM CustomerCalls
WHERE CustomerID = a.CustomerID
ORDER BY CallDate DESC
)

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply

sturner SSC-Insane Points: 22411 More actions · Answer 1

I would probably use a cte, something like this:

;

with MostRecentCalls as (

select CustomerID, max(ID) as callID

from CustomerCalls

group by CustomerID

)

select a.CustomerID, a.CustomerName, ..., C.CallDate, C.CallerID, C.CallDuration, ...

FROM Customers a

LEFT OUTER JOIN

MostRecentCalls b on a.CustomerID = b.CustomerID

JOIN

CustomerCalls C on b.callID = C.ID

The probability of survival is inversely proportional to the angle of arrival.

Alexander Suprun SSCertifiable Points: 6253 More actions · Answer 2

Craig has already posted some test scripts with different solutions and comparison in another topic.

Check this out: http://www.sqlservercentral.com/Forums/FindPost1210550.aspx

Alex Suprun

jeffkretz SSC Eights! Points: 978 More actions · Answer 3

The CROSS APPLY solution is pretty interesting, I haven't messed with that type of query before.

I did some timed tests using Jeff Modon's million row test table, which I modified to have a million row main table and a 50 million row sub table with pricing information.

-- Build the test data

IF OBJECT_ID('tempdb..#TempMain') IS NULL BEGIN

SELECT TOP 1000000

RowNum = IDENTITY(INT,1,1),

SomeInt = ABS(CHECKSUM(NEWID()))%50000+1,

SomeLetters2 = CHAR(ABS(CHECKSUM(NEWID()))%26+65)

+ CHAR(ABS(CHECKSUM(NEWID()))%26+65),

SomeCSV = CAST('Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10' AS VARCHAR(80)),

SomeMoney = CAST(ABS(CHECKSUM(NEWID()))%10000 /100.0 AS MONEY),

SomeDate = CAST(RAND(CHECKSUM(NEWID()))*3653.0+36524.0 AS DATETIME),

SomeHex12 = RIGHT(NEWID(),12)

INTO #TempMain

FROM Master.dbo.SysColumns t1,

Master.dbo.SysColumns t2

ALTER TABLE #TempMain

ADD CONSTRAINT PK_TempMain_RowNum PRIMARY KEY CLUSTERED (RowNum)

END

IF OBJECT_ID('tempdb..#TempSub') IS NULL BEGIN

SELECT TOP 50000000

RowNum = IDENTITY(INT,1,1),

Parent = ABS(CHECKSUM(NEWID()))%1000000+1,

SomeMoney = CAST(ABS(CHECKSUM(NEWID()))%10000 /100.0 AS MONEY),

SomeDate = CAST(RAND(CHECKSUM(NEWID()))*3653.0+36524.0 AS DATETIME)

INTO #TempSub

FROM Master.dbo.SysColumns t1,

Master.dbo.SysColumns t2

ALTER TABLE #TempSub

ADD CONSTRAINT PK_TempSub_RowNum PRIMARY KEY CLUSTERED (RowNum)

CREATE INDEX IX_TempSub_Parent ON #TempSub ( Parent, SomeDate ) INCLUDE ( RowNum )

END

I tried three different approaches:

Method 1 is the pattern I've been using, doing a left join from the main table and using a subquery to get the id of the sub table.

Method 2 uses a CTE to calculate the most recent date and then uses that join to both the main and sub table.

Method 3 uses the OUTER APPLY approach (as opposed to a CROSS APPLY, to ensure we have a main record when nothing exists in the sub table.

-- Throwaway variables.

DECLARE @RowNum1 int, @SomeLetters2 char(2), @SomeCSV varchar(80), @RowNum2 int, @SomeMoney money, @SomeDate datetime

-- Method 1

SET STATISTICS TIME ON

SELECT @RowNum1 = a.RowNum, @SomeLetters2 = a.SomeLetters2, @SomeCSV = a.SomeCSV, @RowNum2 = b.RowNum, @SomeMoney = b.SomeMoney, @SomeDate = b.SomeDate

FROM #TempMain a

LEFT OUTER JOIN

#TempSub b

ON b.RowNum =

(

SELECT TOP 1 RowNum

FROM #TempSub

WHERE Parent = a.RowNum

ORDER BY SomeDate DESC

)

SET STATISTICS TIME OFF

-- Method 2

SET STATISTICS TIME ON

;WITH CTE AS (

SELECT Parent, MAX(SomeDate) as MaxDate

FROM #TempSub

GROUP BY Parent

)

SELECT @RowNum1 = a.RowNum, @SomeLetters2 = a.SomeLetters2, @SomeCSV = a.SomeCSV, @RowNum2 = c.RowNum, @SomeMoney = c.SomeMoney, @SomeDate = c.SomeDate

FROM #TempMain a

LEFT OUTER JOIN

CTE b ON a.RowNum = b.Parent

LEFT OUTER JOIN

#TempSub c ON a.RowNum = c.RowNum AND b.MaxDate = c.SomeDate

SET STATISTICS TIME OFF

-- Method 3

SET STATISTICS TIME ON

SELECT @RowNum1 = a.RowNum, @SomeLetters2 = a.SomeLetters2, @SomeCSV = a.SomeCSV, @RowNum2 = b.RowNum, @SomeMoney = b.SomeMoney, @SomeDate = b.SomeDate

FROM #TempMain a

CROSS APPLY

(

SELECT TOP 1 RowNum, SomeMoney, SomeDate

FROM #TempSub

WHERE Parent = a.RowNum

ORDER BY SomeDate DESC

) b

SET STATISTICS TIME OFF

Results:

SQL Server Execution Times:

CPU time = 16005 ms, elapsed time = 1230 ms.

SQL Server Execution Times:

CPU time = 37159 ms, elapsed time = 3370 ms.

SQL Server Execution Times:

CPU time = 16285 ms, elapsed time = 1238 ms.

Method 1 is pretty solid, though the OUTER APPLY in Method 3 is almost the same.

Are there any other approaches worth giving a try? Or modification of the three queries I've tested above?