May 9, 2008 at 1:15 pm
I have two tables in a SQL 2005 database. One has a row for each data item and day in a period of time and the other has rows for only selected days. I've written a query which joins the two such that for each day I have the specified row from the first table joined to the row in the second table such that it gives me the data for the day in the most recent past relative to the day in the first row.
I know that sounds confusing but it's really pretty straightforward. See the sample queries below. I think they'll illustrate what I am trying to say much better than I can.
Anyway, the query returns the exact result for which I'm looking but it performs horribly. There are about five million rows in one table and several hundred thousand in the sparse table. I believe the bottleneck is the correlated sub-query. I think it is creating a form of RBAR. Is there anyway to write a similar query without the correlated sub-query? I think the answer lies in windowing functions but I cannot seem to get them to work the way I want.
Any help would be greatly appreciated.
if object_id(N'tempdb..#sparseData') is not null
drop table #sparseData
if object_id(N'tempdb..#drivingTable') is not null
drop table #drivingTable
select 1 as keyid, convert(datetime,'2008-01-01') as dateId
into #drivingTable
union
select 2, '2008-01-02'
union
select 3, '2008-01-03'
union
select 2, '2008-01-04'
union
select 4, '2008-01-04'
union
select 5, '2008-01-05'
union
select 6, '2008-01-06'
union
select 7, '2008-01-07'
union
select 8, '2008-01-08'
union
select 9, '2008-01-09'
union
select 10, '2008-01-10'
select 3 as testVal, convert(datetime,'2008-01-01') as dateId
into #sparseData
union
select 10, '2008-01-05'
union
select 7, '2008-01-08'
select dt.*, sd.dateid, sd.testVal
from #drivingTable dt
inner join #sparseData sd
on sd.dateid = (
select max(dateId)
from #sparseData
where dateId <= dt.dateid
)
May 9, 2008 at 2:12 pm
The only way I can think of to do this more efficiently would be to select the max(sd.DateID) for each dt.DateID in one query, then join that to the sparse data table for the test value.
Something like:
create table #DT (
DTDateID datetime,
SDDateID datetime)
insert into #dt (dtdateid, sddateid)
select dt.dateid, max(sd.dateid)
from drivingTable dt
inner join sparseData sd
on dt.dateid >= sd.dateid
group by dt.dateid
select dt.*, sd.testVal
from #dt dt
inner join sparseData sd
on dt.sddateid = sd.dateid
You could play around with adding indexes to the #DT temp table to get better performance out of it, depending on the number of rows that are going into it. You might also need to include the "KeyID" from the driving table in it, depending on how that works with your data.
Try that, see if it improves things for you.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
May 9, 2008 at 2:17 pm
The correlated sub-query is definitely the culprit but there may not be an alternative, but try the last SQL statement below.
if object_id(N'sparseData') is not null drop table sparseData
if object_id(N'drivingTable') is not null drop table drivingTable
create table drivingTable
(KeyIdinteger not null
,drivingDate datetime not null
, constraint drivingTable_P primary key (KeyId , drivingDate)
)
create table sparseData
(testVal integer not null
,sparseDate datetime not null
constraint sparseData_P primary key (sparseDate)
)
go
select drivingTable.*, sparseData.*
from drivingTable
join sparseData
on sparseData.sparseDate = (
select max(sd.sparseDate)
from sparseDatasd
where sd.sparseDate <= drivingTable.drivingDate
)
select drivingTable.*
, sparseData.*
from drivingTable
join(select drivingTable.KeyId
, drivingTable.drivingDate
,MAX(sparseDate) as sparseDate
fromdrivingTable
join sparseData
on sparseData.sparseDate <= drivingTable.drivingDate
group by drivingTable.KeyId
, drivingTable.drivingDate
) as DSL
on DSL.KeyId = drivingTable.KeyId
and DSL.drivingDate = drivingTable.drivingDate
join sparseData
on sparseData.sparseDate = DSL.sparseDate
SQL = Scarcely Qualifies as a Language
May 9, 2008 at 3:04 pm
Thanks! Both of these suggestions do what I need. I look forward to playing around with them and examining their performance relative to the correlated sub-query. I'm sure they are much faster.
May 9, 2008 at 11:41 pm
rhandloff (5/9/2008)
Thanks! Both of these suggestions do what I need. I look forward to playing around with them and examining their performance relative to the correlated sub-query. I'm sure they are much faster.
The simple solution is to use the row_number function. See below.
select keyid,
dateid1,
dateid2,
testval
from(
select dt.keyid,
dt.dateid as dateid1,
sd.dateid as dateid2,
sd.testVal,
row_number()
over(partition by dt.keyid, dt.dateid
order by sd.dateid desc) as row
from #drivingTable dt
inner join #sparseData sd
on sd.dateid <= dt.dateid
) as a
where row = 1
May 12, 2008 at 2:10 am
Another option:
select * from #drivingTable a cross apply
(select top 1 * from #sparseData where dateId <= a.dateid order by dateId desc) b
Ryan Randall
Solutions are easy. Understanding the problem, now, that's the hard part.
May 12, 2008 at 2:26 am
If you add an index (clustered or not) to #sparseData...
create clustered index ix_sd1 on #sparseData (dateId)
...the cross apply has the best performance for the small dataset you give (at least on my machine). Please let us know how the various options perform on your full data (with/without indexes)... 🙂
Ryan Randall
Solutions are easy. Understanding the problem, now, that's the hard part.
May 12, 2008 at 5:16 am
Isn't CROSS APPLY just another name for "correlated sub-query"?
--Jeff Moden
Change is inevitable... Change for the better is not.
May 12, 2008 at 5:27 am
Jeff Moden (5/12/2008)
Isn't CROSS APPLY just another name for "correlated sub-query"?
😛
It sure is, but if it's the fastest method, who cares? 🙂
It's something else to try, anyway, and I'm interested how it fares in this case.
More info here, although I'm sure you (Jeff) know about this...
http://weblogs.sqlteam.com/jeffs/archive/2007/10/18/sql-server-cross-apply.aspx
Ryan Randall
Solutions are easy. Understanding the problem, now, that's the hard part.
May 12, 2008 at 6:18 am
Thanks for all the solutions! I don't know if I will have time to adequately test them all as I'm under a pretty intensive deadline but just from a very cursory testing each of these solutions performs much better than my original query. If I have time I will report back which one worked best.
May 12, 2008 at 6:22 am
rhandloff (5/12/2008)
Thanks for all the solutions! I don't know if I will have time to adequately test them all as I'm under a pretty intensive deadline but just from a very cursory testing each of these solutions performs much better than my original query. If I have time I will report back which one worked best.
Thanks rhandloff - and good luck! 🙂
Ryan Randall
Solutions are easy. Understanding the problem, now, that's the hard part.
May 12, 2008 at 6:23 am
RyanRandall (5/12/2008)
Jeff Moden (5/12/2008)
Isn't CROSS APPLY just another name for "correlated sub-query"?😛
It sure is, but if it's the fastest method, who cares? 🙂
It's something else to try, anyway, and I'm interested how it fares in this case.
More info here, although I'm sure you (Jeff) know about this...
http://weblogs.sqlteam.com/jeffs/archive/2007/10/18/sql-server-cross-apply.aspx
Heh... not my blog, Ryan. I think that's Jeff Smith's, but I'm not sure.
I agree... if Cross Apply is the most performant, then use it. But Cross Apply typically has the same performance as a correlated subquery. If the correlated subquery is non-performant, the Cross Apply won't be any better. Both work seemingly well in the face of small rowcounts and both can work very well in the face of large rowcounts... sometimes. It may be so, in this case... the triangular join in both is severly constrained by the TOP 1... dunno if there may be something faster in this case because I've not studied the problem yet.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 12, 2008 at 6:31 am
Jeff Moden (5/12/2008)
Heh... not my blog, Ryan. I think that's Jeff Smith's, but I'm not sure.
LOL - yes I know it's not your blog. I guess you missed the "Welcome to my weblog. My name is Jeff Smith, and I am a software developer in Boston, MA." bit on the left 😀
We agree on the rest of it... 🙂
Ryan Randall
Solutions are easy. Understanding the problem, now, that's the hard part.
May 12, 2008 at 6:37 am
RyanRandall (5/12/2008)
Jeff Moden (5/12/2008)
Heh... not my blog, Ryan. I think that's Jeff Smith's, but I'm not sure.LOL - yes I know it's not your blog. I guess you missed the "Welcome to my weblog. My name is Jeff Smith, and I am a software developer in Boston, MA." bit on the left 😀
We agree on the rest of it... 🙂
Yep... not enough coffee.......... 😀
--Jeff Moden
Change is inevitable... Change for the better is not.
May 12, 2008 at 7:49 am
Jeff Moden (5/12/2008)
Isn't CROSS APPLY just another name for "correlated sub-query"?
Actually, in the right (wrong) circumstances, Cross Apply can be worse than a regular correlated sub-query. It depends on the function being called. A correlated sub-query (in most cases) pretty much has to be at least set-based in itself, while a multi-select table-valued function can have pretty much unlimited inefficiency.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
Viewing 15 posts - 1 through 15 (of 15 total)
You must be logged in to reply to this topic. Login to reply