Performance issues on huge table (40 columns, 500M rows)

  • I am having servere performance issues on a table I have inherited (it's never MY fault!):-D

    The table holds balances and some other details for approx 3.7M accounts and is written to 4 times a week.

    Each time data is written to the table, some more accounts are added but as far as I can tell we never stop writing balances even when the account is cleared.

    The columns of interest are AccountID,UpdateDate, SubBal-A, SubBal-B,...SubBal-K

    Currently the table has an index on AccountID only and an index on UpdateDate only.

    both indexes are non-clustered

    I have a query with performs well on my development data (200K accounts, 4 files of balances) with an additional index (Unique) Acc_no,UpdateDate - Time to query approx 11 seconds

    This also performs adequately on a separate database of live data (3.7M accounts, 4 files of balances) with the extra index - time to query approx 7 minutes - which I will live with

    So far it has been running 52 minutes on the production server and not returned me any records.

    The estimated query plan indicates that it is doing a table scan of all 476M records which is clearly what is killing it!

    I am trying to find all accounts for a given date where the account either did not exist on the previous file or one or more of the subBals has changed.

    The script is below: tables and fields have been anonymised and the number of comparitors reduced for brevity

    DECLARE @Date as datetime

    set @Date = '20120330';

    ;

    WITH

    DATE_RANK as

    (

    -- Get the processing dates

    SELECT distinct

    update_date

    FROM

    Balances)

    ,

    DATE_RANK2 AS

    (

    -- rank the processing dates so that we can link to the next processing date

    SELECT

    update_date,

    RANK() OVER (ORDER BY UPDATE_DATE DESC) as 'Rank'

    FROM

    DATE_RANK

    )

    ,

    CTE as

    (

    -- Get the records for today and join to those on the previous run:

    -- use the date_rank2 to do this so that I can find the previous date where they are non-contigious

    -- D = today's detail records

    -- P = Previous run Detail records

    SELECT

    D.Acc_no,

    D.Update_date,

    P.Update_date as 'Prev_update_date',

    D.SubBalA,

    D.SubBalB,

    D.SubBalC,

    D.SubBalD,

    FROM

    Balances D

    join

    Date_rank2 DRD on DRD.update_date = D.Update_date

    left join -- get the previous date

    Date_rank2 DRP on DRP.rank = DRD.rank + 1

    left join -- get the records for the previous date

    Balances P

    on P.Acc_no = D.Acc_no

    and P.Update_date = DRP.Update_date

    WHERE

    D.update_date = @date

    and

    (

    -- Compare current to previous (if no prev, then at least one SubBal is not zero)

    D.SubBalA<> coalesce(P.SubBalA,0)OR

    D.SubBalB<> coalesce(P.SubBalC,0)OR

    D.SubBalC<> coalesce(P.SubBalC,0)OR

    D.SubBalD<> coalesce(P.SubBalD,0)

    )

    -- insert the changed recordset into a new shorter table that only contains records which are different from

    -- their predecessor

    insert into ChangedBals

    (

    Acc_no,

    update_date,

    SubBalA,

    SubBalB,

    SubBalC,

    SubBalD,

    )

    Select Acc_no,

    update_date,

    SubBalA,

    SubBalB,

    SubBalC,

    SubBalD,

    from CTE

    I am thinking that the cause of the table scan is that it is having to compare the balance fields which are not part of an index. Would performance improve if I were to put them into a covering index - a new index will take many hours to build due to the size of the table so I don't want to be wasting my time with it. (plus it requires a change request), or do I just need my extra index of Acc_no/UpdateDate

    Or something else (i.e. an entirely new approach) The plan is that once I have the relevant records from the gigantic table we can truncate it because all the info we need will be in changed balances.

  • Your scan is because your where clause is not sargable.

    D.SubBalA<> coalesce(P.SubBalA,0)OR

    D.SubBalB<> coalesce(P.SubBalC,0)OR

    D.SubBalC<> coalesce(P.SubBalC,0)OR

    D.SubBalD<> coalesce(P.SubBalD,0)

    You have both the inequality AND a function. both these are non-sargable.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • And the JOIN criteria

    DRP.rank = DRD.rank + 1

    I suspect that's going to lead to scans too.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Sean,

    Thanks for that. I can rewrite it as

    AND NOT (

    D.SubBalA = coalesce(P.SubBalA,0) AND

    D.SubBalB = coalesce(P.SubBalB,0) AND

    )

    Which sort of gets rid of the inequality by reversing the logic, but I can't see how to get rid of the coalesce without using CASE statements. I could use isnull() instead but it is still a function.

  • Grant Fritchey (4/3/2012)


    And the JOIN criteria

    DRP.rank = DRD.rank + 1

    I suspect that's going to lead to scans too.

    Why? - not saying it won't but what makes you think that it might. It is only on the CTE which has a results set of 143 records and is simply a way to find records with two adjacent (but not always contigious) dates.

    Would I get better performance by copying the relevant fields and records to temp tables and putting indexes on them? I think the copy of the records will take about 20 minutes, but at least it will use and index and then will only match 3.7M x 3.71M rather than 476M x 3.7M

  • aaron.reese (4/3/2012)


    Sean,

    Thanks for that. I can rewrite it as

    AND NOT (

    D.SubBalA = coalesce(P.SubBalA,0) AND

    D.SubBalB = coalesce(P.SubBalB,0) AND

    )

    Which sort of gets rid of the inequality by reversing the logic, but I can't see how to get rid of the coalesce without using CASE statements. I could use isnull() instead but it is still a function.

    How about this?

    AND NOT

    (

    (

    D.SubBalA = P.SubBalA AND p.SubBalA is not null

    or

    D.SubBalA = 0 and P.SubBalA is null

    )

    AND

    (

    D.SubBalB = P.SubBalB AND p.SubBalB is not null

    or

    D.SubBalB = 0 and p.SubBalB is null

    )

    )

    Yes it is harder to read but if you left your original non sargable predicates as comments it would be clear to whoever has to work on this later what you are trying to do.

    --Edit: Missed a set of parenthesis.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • aaron.reese (4/3/2012)


    Grant Fritchey (4/3/2012)


    And the JOIN criteria

    DRP.rank = DRD.rank + 1

    I suspect that's going to lead to scans too.

    Why? - not saying it won't but what makes you think that it might. It is only on the CTE which has a results set of 143 records and is simply a way to find records with two adjacent (but not always contigious) dates.

    Would I get better performance by copying the relevant fields and records to temp tables and putting indexes on them? I think the copy of the records will take about 20 minutes, but at least it will use and index and then will only match 3.7M x 3.71M rather than 476M x 3.7M

    What's the execution plan look like? I could be wrong.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Just a quick update.

    I cancelled the original script after 19 hours because it still had not completed.

    I made the modifications as suggested by Sean and re-ran it.

    I took 1:05 to run so a huge thanks to all who have given me assistance in this issue and to those who offer advice in general, please know that it is always appreciated.

    I am now testing the running of the script in two separate sessions at the same time to see what the timing effects are. I total the script needs to be run 150 times so I may just put it in a stored procedure and run it 24-7 over the weekend and overnight when there are no users on the system. This way I can get my data converted in under a week.

  • Thanks for the kind words and letting us know it worked. It is always nice to hear that our suggestions worked. Feel free to post back if you have any other issues.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • aaron.reese (4/3/2012)


    The columns of interest are AccountID,UpdateDate, SubBal-A, SubBal-B,...SubBal-K

    ...

    Currently the table has an index on AccountID only and an index on UpdateDate only.

    both indexes are non-clustered

    ...

    The estimated query plan indicates that it is doing a table scan of all 476M records which is clearly what is killing it!

    ...

    I am trying to find all accounts for a given date where the account either did not exist on the previous file or one or more of the subBals has changed.

    ...

    It sounds like your table in production doesn't currently have a primary key.

    Consider creating a clustered primary key on ( UpdateDate, AccountID ), and remove the existing non-clustered indexes that are on AccountID and UpdateDate.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply