How do I get rid of this correlated sub-query and avoid the RBAR?

  • I have two tables in a SQL 2005 database. One has a row for each data item and day in a period of time and the other has rows for only selected days. I've written a query which joins the two such that for each day I have the specified row from the first table joined to the row in the second table such that it gives me the data for the day in the most recent past relative to the day in the first row.

    I know that sounds confusing but it's really pretty straightforward. See the sample queries below. I think they'll illustrate what I am trying to say much better than I can.

    Anyway, the query returns the exact result for which I'm looking but it performs horribly. There are about five million rows in one table and several hundred thousand in the sparse table. I believe the bottleneck is the correlated sub-query. I think it is creating a form of RBAR. Is there anyway to write a similar query without the correlated sub-query? I think the answer lies in windowing functions but I cannot seem to get them to work the way I want.

    Any help would be greatly appreciated.

    if object_id(N'tempdb..#sparseData') is not null

    drop table #sparseData

    if object_id(N'tempdb..#drivingTable') is not null

    drop table #drivingTable

    select 1 as keyid, convert(datetime,'2008-01-01') as dateId

    into #drivingTable

    union

    select 2, '2008-01-02'

    union

    select 3, '2008-01-03'

    union

    select 2, '2008-01-04'

    union

    select 4, '2008-01-04'

    union

    select 5, '2008-01-05'

    union

    select 6, '2008-01-06'

    union

    select 7, '2008-01-07'

    union

    select 8, '2008-01-08'

    union

    select 9, '2008-01-09'

    union

    select 10, '2008-01-10'

    select 3 as testVal, convert(datetime,'2008-01-01') as dateId

    into #sparseData

    union

    select 10, '2008-01-05'

    union

    select 7, '2008-01-08'

    select dt.*, sd.dateid, sd.testVal

    from #drivingTable dt

    inner join #sparseData sd

    on sd.dateid = (

    select max(dateId)

    from #sparseData

    where dateId <= dt.dateid

    )

  • The only way I can think of to do this more efficiently would be to select the max(sd.DateID) for each dt.DateID in one query, then join that to the sparse data table for the test value.

    Something like:

    create table #DT (

    DTDateID datetime,

    SDDateID datetime)

    insert into #dt (dtdateid, sddateid)

    select dt.dateid, max(sd.dateid)

    from drivingTable dt

    inner join sparseData sd

    on dt.dateid >= sd.dateid

    group by dt.dateid

    select dt.*, sd.testVal

    from #dt dt

    inner join sparseData sd

    on dt.sddateid = sd.dateid

    You could play around with adding indexes to the #DT temp table to get better performance out of it, depending on the number of rows that are going into it. You might also need to include the "KeyID" from the driving table in it, depending on how that works with your data.

    Try that, see if it improves things for you.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • The correlated sub-query is definitely the culprit but there may not be an alternative, but try the last SQL statement below.

    if object_id(N'sparseData') is not null drop table sparseData

    if object_id(N'drivingTable') is not null drop table drivingTable

    create table drivingTable

    (KeyIdinteger not null

    ,drivingDate datetime not null

    , constraint drivingTable_P primary key (KeyId , drivingDate)

    )

    create table sparseData

    (testVal integer not null

    ,sparseDate datetime not null

    constraint sparseData_P primary key (sparseDate)

    )

    go

    select drivingTable.*, sparseData.*

    from drivingTable

    join sparseData

    on sparseData.sparseDate = (

    select max(sd.sparseDate)

    from sparseDatasd

    where sd.sparseDate <= drivingTable.drivingDate

    )

    select drivingTable.*

    , sparseData.*

    from drivingTable

    join(select drivingTable.KeyId

    , drivingTable.drivingDate

    ,MAX(sparseDate) as sparseDate

    fromdrivingTable

    join sparseData

    on sparseData.sparseDate <= drivingTable.drivingDate

    group by drivingTable.KeyId

    , drivingTable.drivingDate

    ) as DSL

    on DSL.KeyId = drivingTable.KeyId

    and DSL.drivingDate = drivingTable.drivingDate

    join sparseData

    on sparseData.sparseDate = DSL.sparseDate

    SQL = Scarcely Qualifies as a Language

  • Thanks! Both of these suggestions do what I need. I look forward to playing around with them and examining their performance relative to the correlated sub-query. I'm sure they are much faster.

  • rhandloff (5/9/2008)


    Thanks! Both of these suggestions do what I need. I look forward to playing around with them and examining their performance relative to the correlated sub-query. I'm sure they are much faster.

    The simple solution is to use the row_number function. See below.

    select keyid,

    dateid1,

    dateid2,

    testval

    from(

    select dt.keyid,

    dt.dateid as dateid1,

    sd.dateid as dateid2,

    sd.testVal,

    row_number()

    over(partition by dt.keyid, dt.dateid

    order by sd.dateid desc) as row

    from #drivingTable dt

    inner join #sparseData sd

    on sd.dateid <= dt.dateid

    ) as a

    where row = 1

  • Another option:

    select * from #drivingTable a cross apply

    (select top 1 * from #sparseData where dateId <= a.dateid order by dateId desc) b

    Ryan Randall

    Solutions are easy. Understanding the problem, now, that's the hard part.

  • If you add an index (clustered or not) to #sparseData...

    create clustered index ix_sd1 on #sparseData (dateId)

    ...the cross apply has the best performance for the small dataset you give (at least on my machine). Please let us know how the various options perform on your full data (with/without indexes)... 🙂

    Ryan Randall

    Solutions are easy. Understanding the problem, now, that's the hard part.

  • Isn't CROSS APPLY just another name for "correlated sub-query"?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (5/12/2008)


    Isn't CROSS APPLY just another name for "correlated sub-query"?

    😛

    It sure is, but if it's the fastest method, who cares? 🙂

    It's something else to try, anyway, and I'm interested how it fares in this case.

    More info here, although I'm sure you (Jeff) know about this...

    http://weblogs.sqlteam.com/jeffs/archive/2007/10/18/sql-server-cross-apply.aspx

    Ryan Randall

    Solutions are easy. Understanding the problem, now, that's the hard part.

  • Thanks for all the solutions! I don't know if I will have time to adequately test them all as I'm under a pretty intensive deadline but just from a very cursory testing each of these solutions performs much better than my original query. If I have time I will report back which one worked best.

  • rhandloff (5/12/2008)


    Thanks for all the solutions! I don't know if I will have time to adequately test them all as I'm under a pretty intensive deadline but just from a very cursory testing each of these solutions performs much better than my original query. If I have time I will report back which one worked best.

    Thanks rhandloff - and good luck! 🙂

    Ryan Randall

    Solutions are easy. Understanding the problem, now, that's the hard part.

  • RyanRandall (5/12/2008)


    Jeff Moden (5/12/2008)


    Isn't CROSS APPLY just another name for "correlated sub-query"?

    😛

    It sure is, but if it's the fastest method, who cares? 🙂

    It's something else to try, anyway, and I'm interested how it fares in this case.

    More info here, although I'm sure you (Jeff) know about this...

    http://weblogs.sqlteam.com/jeffs/archive/2007/10/18/sql-server-cross-apply.aspx

    Heh... not my blog, Ryan. I think that's Jeff Smith's, but I'm not sure.

    I agree... if Cross Apply is the most performant, then use it. But Cross Apply typically has the same performance as a correlated subquery. If the correlated subquery is non-performant, the Cross Apply won't be any better. Both work seemingly well in the face of small rowcounts and both can work very well in the face of large rowcounts... sometimes. It may be so, in this case... the triangular join in both is severly constrained by the TOP 1... dunno if there may be something faster in this case because I've not studied the problem yet.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (5/12/2008)


    Heh... not my blog, Ryan. I think that's Jeff Smith's, but I'm not sure.

    LOL - yes I know it's not your blog. I guess you missed the "Welcome to my weblog. My name is Jeff Smith, and I am a software developer in Boston, MA." bit on the left 😀

    We agree on the rest of it... 🙂

    Ryan Randall

    Solutions are easy. Understanding the problem, now, that's the hard part.

  • RyanRandall (5/12/2008)


    Jeff Moden (5/12/2008)


    Heh... not my blog, Ryan. I think that's Jeff Smith's, but I'm not sure.

    LOL - yes I know it's not your blog. I guess you missed the "Welcome to my weblog. My name is Jeff Smith, and I am a software developer in Boston, MA." bit on the left 😀

    We agree on the rest of it... 🙂

    Yep... not enough coffee.......... 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (5/12/2008)


    Isn't CROSS APPLY just another name for "correlated sub-query"?

    Actually, in the right (wrong) circumstances, Cross Apply can be worse than a regular correlated sub-query. It depends on the function being called. A correlated sub-query (in most cases) pretty much has to be at least set-based in itself, while a multi-select table-valued function can have pretty much unlimited inefficiency.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply