May 28, 2010 at 3:06 am
Hi,
I'm trying to find a set based approach to this scenario. Lets say I have the following data
declare @data table(
thedate datetime,
val1 integer)
insert into @data values ('2006/12/21',3500)
insert into @data values ('2006/12/22',3500)
insert into @data values ('2006/12/23',3500)
insert into @data values ('2006/12/24',3500)
insert into @data values ('2006/12/25',3500)
insert into @data values ('2006/12/26',3500)
insert into @data values ('2006/12/27',3500)
[font="Courier New"]
I need to return this result where calc is:
1. 0 for the first row
2. Then calculated as the current rows val1 + the previous rows calc value
thedateval1calc
2006/12/2135000
2006/12/2235003500
2006/12/2335007000
2006/12/24350010500
2006/12/25350014000
2006/12/26350017500
2006/12/27350021000
[/font]
Any ideas welcome.
#update forgot to state this is not a running total issue, this is contrived example of much more complex financial calculation.
Cheers
May 28, 2010 at 3:16 am
This sort of problems are called as "Running Totals". All you need is to go thro one of the finest articles that describes what Running Totals are, how to tackle them without cursor/while-loops; this was written by MVP Jeff Moden;
Link : Solving the "Running Total" & "Ordinal Rank" Problems (Rewritten)[/url]
Hope that helps you!
May 28, 2010 at 3:21 am
forgot to state this is not a running total issue, this is contrived example of much more complex financial calculation.
May 28, 2010 at 3:55 am
You can do this using CTE
; WITH cte_data AS
(
SELECTROW_NUMBER() OVER ( ORDER BY thedate ) Row, *
FROM@data
), cte_final_data AS
(
SELECTRow, thedate, val1, 0 calc
FROMcte_data
WHERERow = 1
UNION ALL
SELECTdata.Row, data.thedate, data.val1, final.calc + data.val1
FROMcte_data data
INNER JOIN cte_final_data final ON data.Row = final.Row + 1
)
SELECTthedate, val1, calc
FROMcte_final_data
How to post data/code on a forum to get the best help - Jeff Moden
http://www.sqlservercentral.com/articles/Best+Practices/61537/
May 28, 2010 at 4:01 am
Thanks "SSC Eights!" but this wont for larger datasets due the heavy recursion.
May 28, 2010 at 4:27 am
Add the MAXRECURSION option
SELECT thedate, val1, calc
FROM cte_final_data
OPTION ( MAXRECURSION 0 )
This will work for larger datasets( 0 stands for infinite here ). But yes, no idea how this will affect performance.
How to post data/code on a forum to get the best help - Jeff Moden
http://www.sqlservercentral.com/articles/Best+Practices/61537/
May 28, 2010 at 6:01 am
Unfortunately performance is not great. If I bump up the test case to 3K rows its 1 second for cursor based approach and 15 seconds for recursive cte.
May 29, 2010 at 8:36 am
With the best will in the world, it's impossible to suggest an appropriate set-based solution without knowing more about what the complex financial calculation does. I realise you had to simplify things for a forum question, but those simplifications have removed the core of the problem on this occasion.
This seems to be a variant of the 'running totals' problem since the value for row n depends partly on the value in row (n-1). A more efficient set-based solution can usually be found in this class of problem, but it does depend on the details. Also, any efficient set-based solution will likely require a rewrite of whatever function or stored procedure currently performs the required magic.
Whether it is worth pursuing a set-based alternative, depends heavily on the characteristics of the existing method. Is it running unacceptably slowly at the moment? Is this slowness definitely caused by an existing cursor or loop?
Often, an improvement to the algorithm, indexing, or data structures can provide more of a benefit that simply replacing a loop with set-based code. So...it depends 🙂
Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi
May 30, 2010 at 1:30 am
The real world calculation is Cumulative Performance which I can post the formula on Monday. The only complexity for me lies on the fact that I need the previous rows value for the same column which is a calculation. I don't think you can really re-engineer this, you basically loop the through an ordered result set, the first row is always zero, thereafter the value is a calculation based on other values in the same row and the previous result of the calculation. The tricky part is the fact you need the previous value of a calculation for the same column. If you required another columns previous value or the previous value of the same column that was a static this would be a trivial task.
May 30, 2010 at 2:10 am
Well that all sounds very much like a problem that can be solved and optimised with tweaked 'running totals' code or a custom aggregate.
You don't need to post the exact formula or data, just something representative of the problem you face, in way that is easy to work on.
Please also explain any non-obvious terms, for example "Cumulative Performance" means nothing special to me 🙂
Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi
May 31, 2010 at 7:59 am
ColdCoffee was on the money the "quirky update" method worked a treat. Thanks ColdCoffee!
Here is the complete solution
declare @calc integer
declare @data table(
thedate datetime,
val1 integer,
calc integer)
insert into @data (thedate, val1) values ('2006/12/21',3500)
insert into @data (thedate, val1) values ('2006/12/22',3500)
insert into @data (thedate, val1) values ('2006/12/23',3500)
insert into @data (thedate, val1) values ('2006/12/24',3500)
insert into @data (thedate, val1) values ('2006/12/25',3500)
insert into @data (thedate, val1) values ('2006/12/26',3500)
insert into @data (thedate, val1) values ('2006/12/27',3500)
;with results
as (select top 2147483647 thedate,
val1,
calc
from @data
order by thedate)
update results
set @calc = calc = case
when @calc is null then 0
else val1 + @calc
end
output inserted.thedate,
inserted.val1,
inserted.calc
NB# I'm not sure if the top and order by are enough to guarantee the correct order if someone can comment on that will be great
Cheers
May 31, 2010 at 11:17 am
all4miller (5/31/2010)
NB# I'm not sure if the top and order by are enough to guarantee the correct order if someone can comment on that will be great
Heh... I'm thinking that you didn't actually read the article ColdCoffee provided a link to. Read it and then come back and tell us what essential part that code is missing.
The reason I'm being such a "jerk" about this is that this technique is very controversial and rightly so. If you do it wrong you could seriously damage some data. The only way to learn the right way to do it is to read the article to achieve a deep understanding of the rules and then be able to recite the rules from memory. OSHA has declared that if one more person rides my hiney about the technique that I need to install hand rails on my butt. 😛 Either study the article so you can always do it right or don't use the technique... I don't need someone getting it wrong. 😉
--Jeff Moden
Change is inevitable... Change for the better is not.
May 31, 2010 at 12:07 pm
Thanks for commenting Jeff. I actually got my result from the The ‘Subscription List’ SQL Problem but since the part I used is essentially the same as your post I thought ColdCoffee deserved the credit.
I re-read your article specially the rules section and I guess the stuff I'm missing is stuff that would be implicetly done by sql server behind the scenes... or not..... I have only been sql server for 5 months 🙂
Please let me know what I missed
The RULES
1. CLUSTERED INDEX MUST BE PRESENT IN THE CORRECT ORDER: you cannot create a clustered index on a table variable
3. DON'T WORK AGAINST PARTITIONED STRUCTURES: not an issue in this case
4. USE THE TABLOCKX HINT: If you're updating a local Temp Table, only the current session has access to that Temp Table which means that the TABLOCKX hint really isn't required. also not an issue in this case
5. DO NOT USE JOINS: not an issue in this case
6. YOU MUST HAVE AN "ANCHOR" COLUMN: I think this could be the missing factor yes?
7. DO NOT USE ORDER BY: you cannot have a clustered index on a table variable so I had to use an order by
8. DO NOT USE INDEX HINTS TO TRY TO FORCE ORDER: not an issue in this case
So I think it's either the missing anchor column or bascially table variables just wont work?
Cheers
May 31, 2010 at 2:33 pm
all4miller (5/31/2010)
Thanks for commenting Jeff. I actually got my result from the The ‘Subscription List’ SQL Problem but since the part I used is essentially the same as your post I thought ColdCoffee deserved the credit.I re-read your article specially the rules section and I guess the stuff I'm missing is stuff that would be implicetly done by sql server behind the scenes... or not..... I have only been sql server for 5 months 🙂
Please let me know what I missed
The RULES
1. CLUSTERED INDEX MUST BE PRESENT IN THE CORRECT ORDER: you cannot create a clustered index on a table variable
3. DON'T WORK AGAINST PARTITIONED STRUCTURES: not an issue in this case
4. USE THE TABLOCKX HINT: If you're updating a local Temp Table, only the current session has access to that Temp Table which means that the TABLOCKX hint really isn't required. also not an issue in this case
5. DO NOT USE JOINS: not an issue in this case
6. YOU MUST HAVE AN "ANCHOR" COLUMN: I think this could be the missing factor yes?
7. DO NOT USE ORDER BY: you cannot have a clustered index on a table variable so I had to use an order by
8. DO NOT USE INDEX HINTS TO TRY TO FORCE ORDER: not an issue in this case
So I think it's either the missing anchor column or bascially table variables just wont work?
Cheers
Heh... oddly enough, the post above left out the rule of concern here...
[font="Arial Black"]2. PARALLELISM MUST BE PREVENTED:[/font] You MUST prevent parallelism from occurring. Therefore, you MUST include OPTION (MAXDOP 1) in any such code.
So, if you make that small tweek to your original code, you'll be golden... well, except for the fact that I don't yet trust the "order by" implied by the ROW_NUMBER() because I've had it break on a slightly different problem. It's probably OK here, though, but only because the table variable was created as a heap...
declare @calc integer
declare @data table(
thedate datetime,
val1 integer,
calc integer)
insert into @data (thedate, val1) values ('2006/12/21',3500)
insert into @data (thedate, val1) values ('2006/12/22',3500)
insert into @data (thedate, val1) values ('2006/12/23',3500)
insert into @data (thedate, val1) values ('2006/12/24',3500)
insert into @data (thedate, val1) values ('2006/12/25',3500)
insert into @data (thedate, val1) values ('2006/12/26',3500)
insert into @data (thedate, val1) values ('2006/12/27',3500)
;with results
as (select top 2147483647 thedate,
val1,
calc
from @data
order by thedate)
update results
set @calc = calc = case
when @calc is null then 0
else val1 + @calc
end
output inserted.thedate, --this is a resonable substitute for an "anchor" column but I'd still take the time to include one
inserted.val1,
inserted.calc
OPTION (MAXDOP 1) --absolutely required
Also, you state that you can't put a clustered index on a table variable... any bets there? 😉
declare @calc integer,
@TheDate DATETIME
declare @data table(
thedate datetime PRIMARY KEY CLUSTERED, --Look! Clustered index gets made by this
val1 integer,
calc integer)
insert into @data (thedate, val1) values ('2006/12/21',3500)
insert into @data (thedate, val1) values ('2006/12/22',3500)
insert into @data (thedate, val1) values ('2006/12/23',3500)
insert into @data (thedate, val1) values ('2006/12/24',3500)
insert into @data (thedate, val1) values ('2006/12/25',3500)
insert into @data (thedate, val1) values ('2006/12/26',3500)
insert into @data (thedate, val1) values ('2006/12/27',3500)
update @data
set @calc = calc = case
when @calc is null then 0
else val1 + @calc
end,
@TheDate = TheDate --Anchor, a much better guarantee than an OUTPUT statement.
output inserted.thedate,
inserted.val1,
inserted.calc
FROM @data --Can't put a WITH(TABLOCKX) on a table variable but would do if this was a temp table
OPTION (MAXDOP 1) --absolutely required
Either way is fine but both absolutely must have the OPTION (MAXDOP 1) code to prevent parallelism. Yeah, I know... lot's of folks are going to come back and say the sorted ROW_NUMBER() makes it so it doesn't matter if parallelism occurs... I have some code at work on another method (ie, not the quirky update) where ROW_NUMBER() didn't guarantee the order until I added MAXDOP 1. I'll try to remember to post it here. In the meantime, don't take the chance because I've seen parallelism destroy the order of ROW_NUMBER() before. In fact, I'll trust the update order based on a clustered index long before I trust the order of ROW_NUMBER() during an update.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 31, 2010 at 2:52 pm
Sweet thanks Jeff! 🙂
Viewing 15 posts - 1 through 15 (of 16 total)
You must be logged in to reply to this topic. Login to reply