Optimize this Query

  • I've inherited an ETL project in SSIS which gets the latest records from our legacy database and imports them into an updated database. We have been avoiding using stored procedures. I prefer to keep the queries as simple as possible and do most of the transformation within the SSIS packages. Makes it easier to see each step and review what is happening I think but the problem is that the source query is getting overly-complex and running way too slow for my own standards. I found a way to do this using temp tables which I load first using the max functions and then in another step I write the query to join the temp tables together but that is also a bit of a maintenance headache. I'm sure there is a way to do this without temp tables and without stored procedures. Would be very interested in alternative methods as we are going to have to use this same template in several places.

    There are 6 tables and when we query them we have to make sure to get the latest record for each PublicID, EffectiveDate based on the Action-Count for the latest Process-Date.

    The initial query was written with the RowNumber function to get the latest record in each table but it was taking a long time so when I got it I tried changing it to:

    select gi.*, mn.AcctA, mn.AcctB

    from tblGeneral gi

    join (Select max([Process-Date]) maxPD, max([Action-Count]) maxAct, PublicID, EffectiveDate

    from tblGeneral

    group by PublicID, EffectiveDate) gen on gen.PublicID = gi.PublicID and gen.EffectiveDate = gi.EffectiveDate and gen.maxPD = gi.[Process-Date] and gen.maxAct = gi.[Action-Count]

    join tblMain mn on mn.PublicID = gi.PublicID

    join (select max([Process-Date]) maxPD, max(mn.[Action-Count]) maxAct, PublicID, EffectiveDate

    from tblMain

    where [Process-Date]<=gi.[Process-Date]

    group by PublicID, EffectiveDate) tmain on tmain.PublicID = mn.PublicID and tmain.EffectiveDate = mn.EffectiveDate and

    tmain.maxPD = mn.[Process-Date] and tmain.maxAct=mn.[Action-Count]

    It still processes too slowly so I next created a copy of each source table and then loaded those tables with just part of the query above and then I still have to join these tables together when I actually load the data into the final database tables.

    select gi.*

    from tblGeneral gi

    join (Select max([Process-Date]) maxPD, max([Action-Count]) maxAct, PublicID, EffectiveDate

    from tblGeneral

    group by PublicID, EffectiveDate) tbMax on gi.[PublicID]=tbMax.PublicID and gi.EffectiveDate = tbMax.EffectiveDate and

    gi.[Process-Date]=tbMax.maxPD and gi.[Action-Count]=tbMax.[maxAct)

    select t1.*

    from tblMain t1

    join (Select max([Process-Date]) maxPD, max([Action-Count]) maxAct, PublicID, EffectiveDate

    from tblMain

    group by PublicID, EffectiveDate) tbMax on t1.[PublicID]=tbMax.PublicID and t1.EffectiveDate = tbMax.EffectiveDate and

    t1.[Process-Date]=tbMax.maxPD and t1.[Action-Count]=tbMax.[maxAct)

  • SSIS is not going to perform raw data processing work any faster than the SQL Server database engine. In fact, it will usually be slower.

    Therefore trying to move everything into SSIS and avoid the use of procs and temp tables is not going to get you the performance gains you are looking for, in my opinion.

    Personally, I prefer to keep my SSIS packages as 'clean' as possible. They move data from A to B, do a few lookups and create a few derived columns along the way (ie, use non-blocking transformations only).

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • If you post the actual execution plans for the queries you detailed, it will be helpful to others.

    The absence of evidence is not evidence of absence.
    Martin Rees

    You can lead a horse to water, but a pencil must be lead.
    Stan Laurel

  • kenksoftware (2/24/2016)


    I've inherited an ETL project in SSIS which gets the latest records from our legacy database and imports them into an updated database. We have been avoiding using stored procedures. I prefer to keep the queries as simple as possible and do most of the transformation within the SSIS packages. Makes it easier to see each step and review what is happening I think but the problem is that the source query is getting overly-complex and running way too slow for my own standards. I found a way to do this using temp tables which I load first using the max functions and then in another step I write the query to join the temp tables together but that is also a bit of a maintenance headache.

    Phil hit the nail on the head. I also say that your statement above has identified both the problem and the solution... the solution being to use well formed stored procedures and keeping SSIS "clean" as Phil suggested. The idea of using temp tables to "Divide'n'Conquer" works better in stored procs than anywhere else.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Actually, I was going to suggest using rownumber until I re-read and realised that's what you started with. 😀

    So my suggestion is that you return to rownumber and look instead at what indexes are (or are not) in place.

  • Apologies... David's post reminded me that I didn't actually answer the question as to how to make the queries in the original post faster.

    For these...

    select gi.*

    from tblGeneral gi

    join (Select max([Process-Date]) maxPD, max([Action-Count]) maxAct, PublicID, EffectiveDate

    from tblGeneral

    group by PublicID, EffectiveDate) tbMax on gi.[PublicID]=tbMax.PublicID and gi.EffectiveDate = tbMax.EffectiveDate and

    gi.[Process-Date]=tbMax.maxPD and gi.[Action-Count]=tbMax.[maxAct)

    select t1.*

    from tblMain t1

    join (Select max([Process-Date]) maxPD, max([Action-Count]) maxAct, PublicID, EffectiveDate

    from tblMain

    group by PublicID, EffectiveDate) tbMax on t1.[PublicID]=tbMax.PublicID and t1.EffectiveDate = tbMax.EffectiveDate and

    t1.[Process-Date]=tbMax.maxPD and t1.[Action-Count]=tbMax.[maxAct)

    ... the first thing I'd do is as David suggested... try the ROW_NUMBER() route to find your MAX rows. Also, interrogate the actual execution plans and see if an index or two might help. If indexes are already in place and being used, I'd deep dive them to make sure that Index Seeks aren't being executed more than once, which would dictate that a better index may have to be created (don't kill the old one unless you're absolutely sure it's not being effectively used by other things).

    If even that doesn't help, then it's time to try the "Divide'n'Conquer" method of using a Temp Table created by a SELECT INTO (and then possibly indexed but usually not) to materialize and isolate the derived table SELECT located in the FROM clause.

    I suppose this could all be done in SSIS blocks but I'd still use a stored procedure for this so that if it's needed somewhere else, I wouldn't have to restamp the same wheel out.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply