September 13, 2011 at 1:30 pm
I've been working through Jeff Moden's great article:
http://www.sqlservercentral.com/articles/T-SQL/68467/
as I have a performance issue with a running total.
A brief question on the cursor method - Figure 6 in the article. As far as I can see, this starts with an empty 1 million row table, the running total is null for every row.
My scenario is slightly different, an approx. 1 million row table that holds the previously calculated running totals for all rows and then, daily, I insert a batch of approx 10,000 new rows (that I can grab from the Inserted table). So, what is the best way to do this when 99% on the running totals in the table have already been calculated? I need to somehow find the last non-null running total and it can't be right to use a cursor to go through all 1 million rows just to find it - as calculating will only start on the the first null running total.
I've not provided any DDL or test data as I'm hoping there's an obvious way to go on this, or perhaps a link to point me in the right direction for this scenario - I know there's a lot out there on this topic!
September 13, 2011 at 2:20 pm
Step 1: get the c.u.r.s.o.r. *cough* approach out of your head.
Step 2: get the relevant max values off the "big table".
Step 3: perform the running total on the 10k rows "staging table" adding the related max values (either directly included in the update statement or afterwards)
Step 4: insert the rows in the "big table".
Of course, this requires to guarantee there's no other process inserting rows directly into the "big table" without the precalculated running total. That's probably the most tricky part of it....
September 13, 2011 at 7:00 pm
zapouk (9/13/2011)
I need to somehow find the last non-null running total and it can't be right to use a cursor to go through all 1 million rows just to find it - as calculating will only start on the the first null running total.I've not provided any DDL or test data as I'm hoping there's an obvious way to go on this, or perhaps a link to point me in the right direction for this scenario - I know there's a lot out there on this topic!
A appropriate filtered index (using WHERE running_total_column IS NULL as it's predicate) would help you locate the row quickly.
September 13, 2011 at 11:53 pm
Oddly enough, it wouldn't take much more time to do the Quirky Update on a million row table than it would be to go through all the shenanigans to find the MAX row with a runing total, etc, etc.
--Jeff Moden
Change is inevitable... Change for the better is not.
September 14, 2011 at 1:03 am
Jeff Moden (9/13/2011)
Oddly enough, it wouldn't take much more time to do the Quirky Update on a million row table than it would be to go through all the shenanigans to find the MAX row with a runing total, etc, etc.
Do you mean to re-calc the running total for all rows of the 1mill table after the insert?
That would basically mean an update of all rows (including a unknown number of affected indexes or maybe even indexed views) would be faster than a select max() on a narrow, filtered index, quirky update on 10k of rows followed by an insert. Or am I missing something?
PS: I don't think you'll have to setup a test scenario (especially without the relevant DDL information for the specific case)... 😉
Much better choice: GO OUT AND CELEBRATE!!!!!!!
September 14, 2011 at 1:48 am
Thanks to all for the advice, I'll go away and try to re-write my running total...code will follow if I hit problems!
There seems to be two issues:
1. How to calculate the running total (btw I hardly ever use cursors and happy to do this using a correlated sub-query, but many say this is a possible exception for performance reasons)
2. Do I use a staging table, calculate the running totals in it (by any method), update it with 'max' from the main table, followed by inserting it into the main table. Or reverse the steps, insert into the main table and calculate the running totals in-situ. I guess the first way (as LutzM suggests) but not sure if this is true for calculating a 'running difference'.
September 15, 2011 at 7:47 am
LutzM (9/14/2011)
Jeff Moden (9/13/2011)
Oddly enough, it wouldn't take much more time to do the Quirky Update on a million row table than it would be to go through all the shenanigans to find the MAX row with a runing total, etc, etc.Do you mean to re-calc the running total for all rows of the 1mill table after the insert?
That would basically mean an update of all rows (including a unknown number of affected indexes or maybe even indexed views) would be faster than a select max() on a narrow, filtered index, quirky update on 10k of rows followed by an insert. Or am I missing something?
PS: I don't think you'll have to setup a test scenario (especially without the relevant DDL information for the specific case)... 😉
Much better choice: GO OUT AND CELEBRATE!!!!!!!
We'd certainly have to do a test case at some point to figure out the best way (I'm actually opposed to storing running totals in permanent tables for various reasons we can get into later) but consider that you'd have to find the max row for each affected account and start the QU from there.
What would likely be the easiest and quickest (if you didn't want to do an update on the whole table and still want to store the data in a permanent table) is a subset of what I previously suggested... just do a QU just on only the accounts affected without regard to finding the max row that has a running total.
--Jeff Moden
Change is inevitable... Change for the better is not.
September 15, 2011 at 11:58 am
I'm actually opposed to storing running totals in permanent tables for various reasons we can get into later.
Given the implications for me, you'd better tell me sooner rather than later. 🙂
September 15, 2011 at 2:01 pm
zapouk (9/15/2011)
I'm actually opposed to storing running totals in permanent tables for various reasons we can get into later.
Given the implications for me, you'd better tell me sooner rather than later. 🙂
zapouk,
Consider inserting a row into the middle of the group or updating a value used in the running total in the middle of the group. These types of things invalidate all running totals for the group after the insert/update. Those are just off the top of my head. Keeping calculated columns in a table where the calculations depend on more than the current row in the same table is generally not a good idea. You could do it with INSERT/UPDATE triggers, but that's probably an even worse idea.
Todd Fifield
September 15, 2011 at 2:19 pm
I think I've covered that one by rebuilding the running totals from the earliest date in the table that will be affected by the new batch of rows. Of course this assumes that insert/updates on the table are restricted to this one stored procedure.
September 15, 2011 at 4:50 pm
zapouk (9/15/2011)
I think I've covered that one by rebuilding the running totals from the earliest date in the table that will be affected by the new batch of rows. Of course this assumes that insert/updates on the table are restricted to this one stored procedure.
You can certainly do that but, unless you have a trigger in place that I don't know about, there's nothing to guarantee that someone hasn't inserted a wayward row or made an "adjustment" somewhere even before your latest insert. Considering that the QU will hammer out a running total on a million rows in about 3 seconds, why not make it a part of the batch process that inserts the 10,000 rows.
So far as having running totals in permanent tables goes... I prefer not to do things that way for some of the very reasons that Todd mentioned. Another reason is that, typically, the necessary Clustered Index for the QU is absolutely in the wrong columnar order to prevent page splits on large and, sometimes, very busy tables. Normally such tables deserve a Clustered index on something that more closely represents the inserted order of the rows such as a transaction date or IDENTITY column.
{EDIT} Just be be clear on the above, the Clustered Index doesn't necessarily have to be the PRIMARY KEY column(s). A lot of good folks think that a PK also means "Clustered Index" but that's just because that's what the default is. On largle OLTP tables, there can be a huge maintenance advantage (VERY few page splits) to having a Clustered index on a transaction date or IDENTITY column and a PK on some other columns.
--Jeff Moden
Change is inevitable... Change for the better is not.
October 27, 2011 at 8:57 am
Just an update, decided to abandon the cursor method for calculating running totals and went straight to the Quirky Update - followed Jeff's article to the letter and it works like a dream. Thanks!
November 24, 2011 at 2:01 pm
My most sincere appologies for the late reply. Thank you very much for the feed back.
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 13 posts - 1 through 12 (of 12 total)
You must be logged in to reply to this topic. Login to reply