Select the most recent record closest to primary record date,tricky

Question

Post reply

Select the most recent record closest to primary record date,tricky

Digs

SSCertifiable

Points: 7333
More actions
February 19, 2012 at 4:05 pm

#257061

The data
IF OBJECT_ID('tempdb..#temp1') IS NOT NULL DROP TABLE #temp1;
CREATE TABLE #temp1
(
ProdID INT NOT NULL
,DayDate DATETIME NOT NULL
)
INSERT INTO #temp1 SELECT 5646546,'07/08/2009'
INSERT INTO #temp1 SELECT 5646546,'10/26/2009'
INSERT INTO #temp1 SELECT 5646546,'10/27/2010'
INSERT INTO #temp1 SELECT 5646546,'10/30/2010'
INSERT INTO #temp1 SELECT 5646546,'11/10/2010'
INSERT INTO #temp1 SELECT 5646546,'11/11/2011'
INSERT INTO #temp1 SELECT 5646546,'11/16/2011'
INSERT INTO #temp1 SELECT 5646546,'12/16/2011'
INSERT INTO #temp1 SELECT 5646546,'02/14/2012'
SELECT * FROM #temp1
ORDER BY DayDate DESC
IF OBJECT_ID('tempdb..#temp2') IS NOT NULL DROP TABLE #temp2;
CREATE TABLE #temp2
(
ProdID INT NOT NULL
,DayDate DATETIME NOT NULL
,Units INT
,Value MONEY
)
INSERT INTO #temp2 SELECT 5646546,'01/14/2011',504.00,3402.00
INSERT INTO #temp2 SELECT 5646546,'11/02/2010',564.00,3931.08
INSERT INTO #temp2 SELECT 5646546,'10/07/2010',492.00,3468.60
INSERT INTO #temp2 SELECT 5646546,'09/02/2010',288.00,2113.92
INSERT INTO #temp2 SELECT 5646546,'12/10/2009',835.00,6688.35
INSERT INTO #temp2 SELECT 5646546,'11/11/2009',252.00,2086.56
INSERT INTO #temp2 SELECT 5646546,'10/05/2009',444.00,3472.08
INSERT INTO #temp2 SELECT 5646546,'08/28/2009',444.00,3516.48
INSERT INTO #temp2 SELECT 5646546,'12/29/2008',504.00,3628.80
INSERT INTO #temp2 SELECT 5646546,'12/08/2008',228.00,1639.32
INSERT INTO #temp2 SELECT 5646546,'10/31/2008',228.00,1871.88
INSERT INTO #temp2 SELECT 5646546,'10/13/2008',384.00,3022.08
INSERT INTO #temp2 SELECT 5646546,'08/20/2008',384.00,2903.04
INSERT INTO #temp2 SELECT 5646546,'01/31/2008',147.00,1093.68
INSERT INTO #temp2 SELECT 5646546,'01/07/2008',378.00,2766.96
INSERT INTO #temp2 SELECT 5646546,'11/16/2007',406.00,2943.50
INSERT INTO #temp2 SELECT 5646546,'10/29/2007',210.00,1543.50
INSERT INTO #temp2 SELECT 5646546,'09/24/2007',406.00,2943.50
INSERT INTO #temp2 SELECT 5646546,'08/20/2007',224.00,1617.28
SELECT * FROM #temp2
ORDER BY DayDate DESC
IF OBJECT_ID('tempdb..#temp1') IS NOT NULL DROP TABLE #temp1;
IF OBJECT_ID('tempdb..#temp2') IS NOT NULL DROP TABLE #temp2;
Example, Looking at #temp1 record 5646546,'07/08/2009' I am try to find the nearest record prior to the '07/08/2009' in #temp2 by row . The result would be this one 5646546,'12/29/2008',504.00,3628.80)
So the JOIN between the #temp1 and #temp2 some how has to has to have a MAX on #temp2 DayDate
Any ideas how to do this

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

LutzM SSC Guru Points: 107049 More actions · Answer 1

I guess the OUTER APPLY approach will return the data you're looking for:

SELECT *

FROM #temp1

OUTER APPLY

(

SELECT TOP 1 *

FROM #temp2

WHERE #temp2.ProdID =#temp1.ProdID

AND #temp2.DayDate < #temp1.DayDate

ORDER BY #temp2.DayDate DESC

)t2

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

Digs SSCertifiable Points: 7333 More actions · Answer 2

Digs

SSCertifiable

Points: 7333

February 19, 2012 at 4:59 pm

#1449035

Thanks will learn more on this OUTER APPLY

Dwain Camps SSC Guru Points: 86978 More actions · Answer 3

I believe this solution will also work and if you compare the actual query plan to the CROSS APPLY solution, it is about 5x better.

SELECT *

FROM (

SELECT #temp1.ProdID As ProdID1

, #temp1.DayDate As DayDate1

, #temp2.ProdID As ProdID2

, #temp2.DayDate As DayDate2

, Units, Value

, ROW_NUMBER() OVER (PARTITION BY #temp1.ProdID, #temp1.DayDate ORDER BY #temp1.ProdID, #temp2.DayDate, #temp2.DayDate) As rk

FROM #temp1

INNER JOIN #temp2

ON #temp2.ProdID =#temp1.ProdID

AND #temp2.DayDate < #temp1.DayDate

) x WHERE rk = 1

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

Jeff Moden SSC Guru Points: 1004409 More actions · Answer 4

dwain.c (2/19/2012)
I believe this solution will also work

Unfortunately, it doesn't. (See the results below). That's why I always ask for test data which the OP very nicely provided on this thread.

and if you compare the actual query plan to the CROSS APPLY solution, it is about 5x better.

Heh... it does actually run about 5 times faster. But you shouldn't use the "Percent of Batch" to compare performance expectations. I can show you code that is exactly backwards for "Percent of Batch" compared to the actual performance. Also, things do tend to run faster when you haven't actually calculated the correct answer.

Here are the results of the OP's test data and your code.

ProdID1 DayDate1 ProdID2 DayDate2 Units Value rk

----------- ----------------------- ----------- ----------------------- ----------- --------------------- --------------------

5646546 2009-07-08 00:00:00.000 5646546 2007-08-20 00:00:00.000 224 1617.28 1

5646546 2009-10-26 00:00:00.000 5646546 2007-08-20 00:00:00.000 224 1617.28 1

5646546 2010-10-27 00:00:00.000 5646546 2007-08-20 00:00:00.000 224 1617.28 1

5646546 2010-10-30 00:00:00.000 5646546 2007-08-20 00:00:00.000 224 1617.28 1

5646546 2010-11-10 00:00:00.000 5646546 2007-08-20 00:00:00.000 224 1617.28 1

5646546 2011-11-11 00:00:00.000 5646546 2007-08-20 00:00:00.000 224 1617.28 1

5646546 2011-11-16 00:00:00.000 5646546 2007-08-20 00:00:00.000 224 1617.28 1

5646546 2011-12-16 00:00:00.000 5646546 2007-08-20 00:00:00.000 224 1617.28 1

5646546 2012-02-14 00:00:00.000 5646546 2007-08-20 00:00:00.000 224 1617.28 1

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Dwain Camps SSC Guru Points: 86978 More actions · Answer 5

Jeff - Thanks for pointing that out. I was a bit rushed (had to run off to a meeting) so I didn't carefully check the result set.

However the concept was right. I just made two minor mistakes and the query plan cost is still 20% of the OUTER JOIN.

SELECT *

FROM (

SELECT #temp1.ProdID As ProdID1

, #temp1.DayDate As DayDate1

, #temp2.ProdID As ProdID2

, #temp2.DayDate As DayDate2

, Units, Value

, ROW_NUMBER() OVER (PARTITION BY #temp1.ProdID, #temp1.DayDate

ORDER BY #temp1.ProdID, #temp1.DayDate DESC) As rk

FROM #temp1

INNER JOIN #temp2

ON #temp2.ProdID = #temp1.ProdID AND #temp2.DayDate < #temp1.DayDate

) x WHERE rk = 1

Results (OUTER JOIN is first):

Query plan cost compared (mine is the second):

However when running with STATISTICS TIME ON/STATISTICS IO ON, it appears mine runs just a tad longer elapsed time. I think you'd need to try both with a larger data set to be sure. As Jeff notes, the query plan cost doesn't always tell the entire story. I've seen plans costing 5% vs. 95% that run slower in practice. It is possible that the difference in run times (it was small) is just noise. Possibly once SQL Server gets smarter and has the query plan cached, the second would eventually run faster. The plan cost went from 25%/5% on my first try to 27%/5% on the above try.

Apologies to the original poster if I confused things. If you have a larger data set to run it on, I'd be interested to know which worked out best in the long run.

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

Dwain Camps SSC Guru Points: 86978 More actions · Answer 6

After speculating about the timing with a larger record set, I got curious.

So I ran these two queries after creating/INSERTing into the two temp tables. They basicaly duplicate the existing part number 10000 times (all records) by adding a tally to the part number.

;WITH Nbrs_3( n ) AS ( SELECT 1 UNION SELECT 0 )

,Nbrs_2( n ) AS ( SELECT 1 FROM Nbrs_3 n1 CROSS JOIN Nbrs_3 n2 )

,Nbrs_1( n ) AS ( SELECT 1 FROM Nbrs_2 n1 CROSS JOIN Nbrs_2 n2 )

,Nbrs_0( n ) AS ( SELECT 1 FROM Nbrs_1 n1 CROSS JOIN Nbrs_1 n2 )

,Nbrs ( n ) AS ( SELECT 1 FROM Nbrs_0 n1 CROSS JOIN Nbrs_0 n2 )

INSERT INTO #temp1

SELECT ProdID + n As ProdID, DayDate FROM #temp1

CROSS APPLY (SELECT ROW_NUMBER() OVER (ORDER BY n) As n FROM Nbrs) D

WHERE n <=10000

;WITH Nbrs_3( n ) AS ( SELECT 1 UNION SELECT 0 )

,Nbrs_2( n ) AS ( SELECT 1 FROM Nbrs_3 n1 CROSS JOIN Nbrs_3 n2 )

,Nbrs_1( n ) AS ( SELECT 1 FROM Nbrs_2 n1 CROSS JOIN Nbrs_2 n2 )

,Nbrs_0( n ) AS ( SELECT 1 FROM Nbrs_1 n1 CROSS JOIN Nbrs_1 n2 )

,Nbrs ( n ) AS ( SELECT 1 FROM Nbrs_0 n1 CROSS JOIN Nbrs_0 n2 )

INSERT INTO #temp2

SELECT ProdID + n As ProdID, DayDate, Units, Value FROM #temp2

CROSS APPLY (SELECT ROW_NUMBER() OVER (ORDER BY n) As n FROM Nbrs) D

WHERE n <=10000

With

SET STATISTICS IO ON

SET STATISTICS TIME ON

I got the below results, which is a pretty strong indicator that the PARTITION version of the query is a bit faster in practice:

Table 'Worktable'. Scan count 270027, logical reads 1639697, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table '#temp2______________________________________________________________________________________________________________00000000002F'. Scan count 1, logical reads 776, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table '#temp1______________________________________________________________________________________________________________00000000002E'. Scan count 1, logical reads 234, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row(s) affected)

SQL Server Execution Times:

CPU time = 3604 ms, elapsed time = 3883 ms.

(90009 row(s) affected)

Table 'Worktable'. Scan count 90009, logical reads 771571, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table '#temp2______________________________________________________________________________________________________________00000000002F'. Scan count 1, logical reads 776, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table '#temp1______________________________________________________________________________________________________________00000000002E'. Scan count 1, logical reads 234, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row(s) affected)

SQL Server Execution Times:

CPU time = 1373 ms, elapsed time = 1571 ms.

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

Mark Cowne One Orange Chip Points: 26966 More actions · Answer 7

dwain.c (2/20/2012)
Jeff - Thanks for pointing that out. I was a bit rushed (had to run off to a meeting) so I didn't carefully check the result set.
However the concept was right. I just made two minor mistakes and the query plan cost is still 20% of the OUTER JOIN.
SELECT *
FROM (
SELECT #temp1.ProdID As ProdID1
, #temp1.DayDate As DayDate1
, #temp2.ProdID As ProdID2
, #temp2.DayDate As DayDate2
, Units, Value
, ROW_NUMBER() OVER (PARTITION BY #temp1.ProdID, #temp1.DayDate
ORDER BY #temp1.ProdID, #temp1.DayDate DESC) As rk
FROM #temp1
INNER JOIN #temp2
ON #temp2.ProdID = #temp1.ProdID AND #temp2.DayDate < #temp1.DayDate
) x WHERE rk = 1
Results (OUTER JOIN is first):
Query plan cost compared (mine is the second):
However when running with STATISTICS TIME ON/STATISTICS IO ON, it appears mine runs just a tad longer elapsed time. I think you'd need to try both with a larger data set to be sure. As Jeff notes, the query plan cost doesn't always tell the entire story. I've seen plans costing 5% vs. 95% that run slower in practice. It is possible that the difference in run times (it was small) is just noise. Possibly once SQL Server gets smarter and has the query plan cached, the second would eventually run faster. The plan cost went from 25%/5% on my first try to 27%/5% on the above try.
Apologies to the original poster if I confused things. If you have a larger data set to run it on, I'd be interested to know which worked out best in the long run.

Can't be right surely?

PARTITION BY #temp1.ProdID, #temp1.DayDate

ORDER BY #temp1.ProdID, #temp1.DayDate DESC

More likely to be

PARTITION BY #temp1.ProdID, #temp1.DayDate

ORDER BY #temp2.DayDate DESC

____________________________________________________

Deja View - The strange feeling that somewhere, sometime you've optimised this query before

How to get the best help on a forum

http://www.sqlservercentral.com/articles/Best+Practices/61537

Dwain Camps SSC Guru Points: 86978 More actions · Answer 8

Mark,

Good catch. It seems these 3 all produce the same results. Must be due to the ordering of the data coming into the table and coincidence.

PARTITION BY #temp1.ProdID, #temp1.DayDate

ORDER BY #temp1.ProdID, #temp1.DayDate DESC

PARTITION BY #temp1.ProdID, #temp1.DayDate

ORDER BY #temp1.ProdID DESC

PARTITION BY #temp1.ProdID, #temp1.DayDate

ORDER BY #temp1.ProdID, #temp2.DayDate DESC

I think the correction solution is the last one.

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St