Performance issues on huge table (40 columns, 500M rows)

Question

Post reply

Performance issues on huge table (40 columns, 500M rows)

aaron.reese

SSChampion

Points: 13498
More actions
April 3, 2012 at 9:13 am

#266860

I am having servere performance issues on a table I have inherited (it's never MY fault!):-D
The table holds balances and some other details for approx 3.7M accounts and is written to 4 times a week.
Each time data is written to the table, some more accounts are added but as far as I can tell we never stop writing balances even when the account is cleared.
The columns of interest are AccountID,UpdateDate, SubBal-A, SubBal-B,...SubBal-K
Currently the table has an index on AccountID only and an index on UpdateDate only.
both indexes are non-clustered
I have a query with performs well on my development data (200K accounts, 4 files of balances) with an additional index (Unique) Acc_no,UpdateDate - Time to query approx 11 seconds
This also performs adequately on a separate database of live data (3.7M accounts, 4 files of balances) with the extra index - time to query approx 7 minutes - which I will live with
So far it has been running 52 minutes on the production server and not returned me any records.
The estimated query plan indicates that it is doing a table scan of all 476M records which is clearly what is killing it!
I am trying to find all accounts for a given date where the account either did not exist on the previous file or one or more of the subBals has changed.
The script is below: tables and fields have been anonymised and the number of comparitors reduced for brevity
DECLARE @Date as datetime
set @Date = '20120330';
;
WITH
DATE_RANK as
(
-- Get the processing dates
SELECT distinct
update_date
FROM
Balances)
,
DATE_RANK2 AS
(
-- rank the processing dates so that we can link to the next processing date
SELECT
update_date,
RANK() OVER (ORDER BY UPDATE_DATE DESC) as 'Rank'
FROM
DATE_RANK
)
,
CTE as
(
-- Get the records for today and join to those on the previous run:
-- use the date_rank2 to do this so that I can find the previous date where they are non-contigious
-- D = today's detail records
-- P = Previous run Detail records
SELECT
D.Acc_no,
D.Update_date,
P.Update_date as 'Prev_update_date',
D.SubBalA,
D.SubBalB,
D.SubBalC,
D.SubBalD,
FROM
Balances D
join
Date_rank2 DRD on DRD.update_date = D.Update_date
left join -- get the previous date
Date_rank2 DRP on DRP.rank = DRD.rank + 1
left join -- get the records for the previous date
Balances P
on P.Acc_no = D.Acc_no
and P.Update_date = DRP.Update_date
WHERE
D.update_date = @date
and
(
-- Compare current to previous (if no prev, then at least one SubBal is not zero)
D.SubBalA<> coalesce(P.SubBalA,0)OR
D.SubBalB<> coalesce(P.SubBalC,0)OR
D.SubBalC<> coalesce(P.SubBalC,0)OR
D.SubBalD<> coalesce(P.SubBalD,0)
)
-- insert the changed recordset into a new shorter table that only contains records which are different from
-- their predecessor
insert into ChangedBals
(
Acc_no,
update_date,
SubBalA,
SubBalB,
SubBalC,
SubBalD,
)
Select Acc_no,
update_date,
SubBalA,
SubBalB,
SubBalC,
SubBalD,
from CTE
I am thinking that the cause of the table scan is that it is having to compare the balance fields which are not part of an index. Would performance improve if I were to put them into a covering index - a new index will take many hours to build due to the size of the table so I don't want to be wasting my time with it. (plus it requires a change request), or do I just need my extra index of Acc_no/UpdateDate
Or something else (i.e. an entirely new approach) The plan is that once I have the relevant records from the gigantic table we can truncate it because all the info we need will be in changed balances.

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply

Sean Lange SSC Guru Points: 286590 More actions · Answer 1

Your scan is because your where clause is not sargable.

D.SubBalA<> coalesce(P.SubBalA,0)OR

D.SubBalB<> coalesce(P.SubBalC,0)OR

D.SubBalC<> coalesce(P.SubBalC,0)OR

D.SubBalD<> coalesce(P.SubBalD,0)

You have both the inequality AND a function. both these are non-sargable.

_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

Grant Fritchey SSC Guru Points: 398908 More actions · Answer 2

And the JOIN criteria

DRP.rank = DRD.rank + 1

I suspect that's going to lead to scans too.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

aaron.reese SSChampion Points: 13498 More actions · Answer 3

Sean,

Thanks for that. I can rewrite it as

AND NOT (

D.SubBalA = coalesce(P.SubBalA,0) AND

D.SubBalB = coalesce(P.SubBalB,0) AND

)

Which sort of gets rid of the inequality by reversing the logic, but I can't see how to get rid of the coalesce without using CASE statements. I could use isnull() instead but it is still a function.

aaron.reese SSChampion Points: 13498 More actions · Answer 4

Grant Fritchey (4/3/2012)
And the JOIN criteria
DRP.rank = DRD.rank + 1
I suspect that's going to lead to scans too.

Why? - not saying it won't but what makes you think that it might. It is only on the CTE which has a results set of 143 records and is simply a way to find records with two adjacent (but not always contigious) dates.

Would I get better performance by copying the relevant fields and records to temp tables and putting indexes on them? I think the copy of the records will take about 20 minutes, but at least it will use and index and then will only match 3.7M x 3.71M rather than 476M x 3.7M

Sean Lange SSC Guru Points: 286590 More actions · Answer 5

aaron.reese (4/3/2012)
Sean,
Thanks for that. I can rewrite it as
AND NOT (
D.SubBalA = coalesce(P.SubBalA,0) AND
D.SubBalB = coalesce(P.SubBalB,0) AND
)
Which sort of gets rid of the inequality by reversing the logic, but I can't see how to get rid of the coalesce without using CASE statements. I could use isnull() instead but it is still a function.

How about this?

AND NOT

(

D.SubBalA = P.SubBalA AND p.SubBalA is not null

or

D.SubBalA = 0 and P.SubBalA is null

)

AND

(

D.SubBalB = P.SubBalB AND p.SubBalB is not null

or

D.SubBalB = 0 and p.SubBalB is null

)

Yes it is harder to read but if you left your original non sargable predicates as comments it would be clear to whoever has to work on this later what you are trying to do.

--Edit: Missed a set of parenthesis.

_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

Grant Fritchey SSC Guru Points: 398908 More actions · Answer 6

aaron.reese (4/3/2012)
Grant Fritchey (4/3/2012)
And the JOIN criteria
DRP.rank = DRD.rank + 1
I suspect that's going to lead to scans too.
Why? - not saying it won't but what makes you think that it might. It is only on the CTE which has a results set of 143 records and is simply a way to find records with two adjacent (but not always contigious) dates.
Would I get better performance by copying the relevant fields and records to temp tables and putting indexes on them? I think the copy of the records will take about 20 minutes, but at least it will use and index and then will only match 3.7M x 3.71M rather than 476M x 3.7M

What's the execution plan look like? I could be wrong.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

aaron.reese SSChampion Points: 13498 More actions · Answer 7

Just a quick update.

I cancelled the original script after 19 hours because it still had not completed.

I made the modifications as suggested by Sean and re-ran it.

I took 1:05 to run so a huge thanks to all who have given me assistance in this issue and to those who offer advice in general, please know that it is always appreciated.

I am now testing the running of the script in two separate sessions at the same time to see what the timing effects are. I total the script needs to be run 150 times so I may just put it in a stored procedure and run it 24-7 over the weekend and overnight when there are no users on the system. This way I can get my data converted in under a week.

Sean Lange SSC Guru Points: 286590 More actions · Answer 8

Thanks for the kind words and letting us know it worked. It is always nice to hear that our suggestions worked. Feel free to post back if you have any other issues.

_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

Eric M Russell SSC Guru Points: 125562 More actions · Answer 9

aaron.reese (4/3/2012)
The columns of interest are AccountID,UpdateDate, SubBal-A, SubBal-B,...SubBal-K
...
Currently the table has an index on AccountID only and an index on UpdateDate only.
both indexes are non-clustered
...
The estimated query plan indicates that it is doing a table scan of all 476M records which is clearly what is killing it!
...
I am trying to find all accounts for a given date where the account either did not exist on the previous file or one or more of the subBals has changed.
...

It sounds like your table in production doesn't currently have a primary key.

Consider creating a clustered primary key on ( UpdateDate, AccountID ), and remove the existing non-clustered indexes that are on AccountID and UpdateDate.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho