October 18, 2012 at 9:44 am
Hi all,
Got the following problem situation:
CREATE TABLE #Test
(
ID INT,
Date DATETIME
)
CREATE TABLE #Test2
(
ID INT,
Date DATETIME,
Val INT
)
INSERT INTO #Test (ID, Date) VALUES (1, '2012-09-01')
INSERT INTO #Test (ID, Date) VALUES (1, '2012-09-02')
INSERT INTO #Test (ID, Date) VALUES (1, '2012-09-04')
INSERT INTO #Test (ID, Date) VALUES (2, '2012-09-05')
INSERT INTO #Test (ID, Date) VALUES (2, '2012-09-08')
INSERT INTO #Test (ID, Date) VALUES (3, '2012-09-15')
INSERT INTO #Test2 (ID, Date, Val) VALUES (1, '2012-09-01', 2)
INSERT INTO #Test2 (ID, Date, Val) VALUES (1, '2012-09-02', 2)
INSERT INTO #Test2 (ID, Date, Val) VALUES (1, '2012-09-03', 2)
INSERT INTO #Test2 (ID, Date, Val) VALUES (1, '2012-09-04', 3)
INSERT INTO #Test2 (ID, Date, Val) VALUES (1, '2012-09-05', 3)
INSERT INTO #Test2 (ID, Date, Val) VALUES (2, '2012-09-05', 2)
INSERT INTO #Test2 (ID, Date, Val) VALUES (2, '2012-09-06', 4)
INSERT INTO #Test2 (ID, Date, Val) VALUES (3, '2012-09-08', 2)
INSERT INTO #Test2 (ID, Date, Val) VALUES (3, '2012-09-09', 5)
INSERT INTO #Test2 (ID, Date, Val) VALUES (3, '2012-09-20', 5)
Basically, I have one table that has a value and a date. For each record in that table, I want to find the value in the second table which corresponds to the date that is closest to the date from the first table.
Here's the solution that I've employed:
WITH cte AS
(
SELECT
t.ID,
t.Date,
p.Val,
ROW_NUMBER() OVER (PARTITION BY t.ID, t.Date ORDER BY ABS(DATEDIFF(dd, t.Date, p.Date))) AS rowNum
FROM #Test t
LEFT JOIN #Test2 p ON t.ID = p.ID
)
SELECT
ID,
Date,
Val
FROM cte
WHERE rowNum = 1
This worked fine when I had only a few hundred records in my first table. The problem is that when the number of records goes up, the amount of time increases very sharply. Right now, that cte is returning me about a million records, and is taking around 30 seconds to run. I need to bring it down to around 2-3 seconds ideally.
Any suggestions? As it stands, there is an index on the second table, on the field ID, which includes Date and Val as included columns. On the first table, there's a clustered index on the ID field.
October 18, 2012 at 10:15 am
... Right now, that cte is returning me about a million records, and is taking around 30 seconds to run. I need to bring it down to around 2-3 seconds ideally.
...
You need to find where the most of time is spent. I would guess that is spend on creating product of JOIN between your two tables on ID's.
Can you measure how long is takes just to join your two tables?
SELECT t1.*, t2.*
FROM T1 JOIN T2 ON T1.ID=T2.ID
October 18, 2012 at 10:33 am
It's certainly that part which is taking the most time. I just don't know how else to go about it other than using the approach I've listed.
October 19, 2012 at 12:36 am
You can do it with a subquery but I tried that and it's slower. Your query is pretty good.
But there is another approach if you're running more than one processor and all you are concerned with is elapsed time. Try this:
SELECT ID, Date, Val
FROM #Test a
CROSS APPLY (
SELECT TOP 1 Val
FROM #Test2 b
WHERE a.ID = b.ID
ORDER BY ABS(DATEDIFF(day, a.Date, b.Date))) b
I put it into a test harness with ~3,000,000 rows in #Test and ~5,000,000 rows in #Test2:
CREATE TABLE #Test
(
ID INT,
Date DATETIME
)
CREATE CLUSTERED INDEX TEST ON #Test (ID, Date);
CREATE TABLE #Test2
(
ID INT,
Date DATETIME,
Val INT
)
CREATE INDEX TEST2 ON #Test2 (ID) INCLUDE(Date, Val)
INSERT INTO #Test (ID, Date) VALUES (1, '2012-09-01')
INSERT INTO #Test (ID, Date) VALUES (1, '2012-09-02')
INSERT INTO #Test (ID, Date) VALUES (1, '2012-09-04')
INSERT INTO #Test (ID, Date) VALUES (2, '2012-09-05')
INSERT INTO #Test (ID, Date) VALUES (2, '2012-09-08')
INSERT INTO #Test (ID, Date) VALUES (3, '2012-09-15')
INSERT INTO #Test2 (ID, Date, Val) VALUES (1, '2012-09-01', 2)
INSERT INTO #Test2 (ID, Date, Val) VALUES (1, '2012-09-02', 2)
INSERT INTO #Test2 (ID, Date, Val) VALUES (1, '2012-09-03', 2)
INSERT INTO #Test2 (ID, Date, Val) VALUES (1, '2012-09-04', 3)
INSERT INTO #Test2 (ID, Date, Val) VALUES (1, '2012-09-05', 3)
INSERT INTO #Test2 (ID, Date, Val) VALUES (2, '2012-09-05', 2)
INSERT INTO #Test2 (ID, Date, Val) VALUES (2, '2012-09-06', 4)
INSERT INTO #Test2 (ID, Date, Val) VALUES (3, '2012-09-08', 2)
INSERT INTO #Test2 (ID, Date, Val) VALUES (3, '2012-09-09', 5)
INSERT INTO #Test2 (ID, Date, Val) VALUES (3, '2012-09-20', 5)
;WITH Tally (n) AS (
SELECT TOP 500000 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM sys.all_columns a CROSS JOIN sys.all_columns b)
INSERT INTO #Test
SELECT ID + 10*n, Date
FROM #Test
CROSS APPLY Tally
;WITH Tally (n) AS (
SELECT TOP 500000 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM sys.all_columns a CROSS JOIN sys.all_columns b)
INSERT INTO #Test2
SELECT ID + 10*n, Date, VAL + n
FROM #Test2
CROSS APPLY Tally
PRINT '-- CTE query'
SET STATISTICS TIME ON
;WITH cte AS
(
SELECT
t.ID,
t.Date,
p.Val,
ROW_NUMBER() OVER (PARTITION BY t.ID, t.Date ORDER BY ABS(DATEDIFF(dd, t.Date, p.Date))) AS rowNum
FROM #Test t
LEFT JOIN #Test2 p ON t.ID = p.ID
)
SELECT
ID,
Date,
Val
FROM cte
WHERE rowNum = 1
SET STATISTICS TIME OFF
PRINT '-- CROSS APPLY query'
SET STATISTICS TIME ON
SELECT ID, Date, Val
FROM #Test a
CROSS APPLY (
SELECT TOP 1 Val
FROM #Test2 b
WHERE a.ID = b.ID
ORDER BY ABS(DATEDIFF(day, a.Date, b.Date))) b
SET STATISTICS TIME OFF
DROP TABLE #Test, #Test2
And these are the results I got:
(3000000 row(s) affected)
(5000000 row(s) affected)
-- CTE query
(3000006 row(s) affected)
SQL Server Execution Times:
CPU time = 35474 ms, elapsed time = 79574 ms.
-- CROSS APPLY query
(3000006 row(s) affected)
SQL Server Execution Times:
CPU time = 77875 ms, elapsed time = 52099 ms.
Notice how CPU is higher than elapsed time? That's because SQL is parallelizing the query so you end up burning more CPU but cutting down your elapsed time.
Perhaps this will work for you.
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
Viewing 4 posts - 1 through 3 (of 3 total)
You must be logged in to reply to this topic. Login to reply