Set based approach to get previous rows value for same column which is a calculation

Question

Set based approach to get previous rows value for same column which is a calculation

all4miller

SSC-Addicted

Points: 409
More actions
May 28, 2010 at 3:06 am

#222066

Hi,
I'm trying to find a set based approach to this scenario. Lets say I have the following data
declare @data table(
thedate datetime,
val1 integer)
insert into @data values ('2006/12/21',3500)
insert into @data values ('2006/12/22',3500)
insert into @data values ('2006/12/23',3500)
insert into @data values ('2006/12/24',3500)
insert into @data values ('2006/12/25',3500)
insert into @data values ('2006/12/26',3500)
insert into @data values ('2006/12/27',3500)
[font="Courier New"]
I need to return this result where calc is:
1. 0 for the first row
2. Then calculated as the current rows val1 + the previous rows calc value
thedateval1calc
2006/12/2135000
2006/12/2235003500
2006/12/2335007000
2006/12/24350010500
2006/12/25350014000
2006/12/26350017500
2006/12/27350021000
[/font]
Any ideas welcome.
#update forgot to state this is not a running total issue, this is contrived example of much more complex financial calculation.
Cheers

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply

ColdCoffee SSC-Dedicated Points: 39972 More actions · Answer 1

This sort of problems are called as "Running Totals". All you need is to go thro one of the finest articles that describes what Running Totals are, how to tackle them without cursor/while-loops; this was written by MVP Jeff Moden;

Link : Solving the "Running Total" & "Ordinal Rank" Problems (Rewritten)[/url]

Hope that helps you!

all4miller SSC-Addicted Points: 409 More actions · Answer 2

forgot to state this is not a running total issue, this is contrived example of much more complex financial calculation.

Kingston Dhasian SSCoach Points: 19799 More actions · Answer 3

You can do this using CTE

; WITH cte_data AS

(

SELECTROW_NUMBER() OVER ( ORDER BY thedate ) Row, *

FROM@data

), cte_final_data AS

(

SELECTRow, thedate, val1, 0 calc

FROMcte_data

WHERERow = 1

UNION ALL

SELECTdata.Row, data.thedate, data.val1, final.calc + data.val1

FROMcte_data data

INNER JOIN cte_final_data final ON data.Row = final.Row + 1

)

SELECTthedate, val1, calc

FROMcte_final_data

Kingston Dhasian

How to post data/code on a forum to get the best help - Jeff Moden
http://www.sqlservercentral.com/articles/Best+Practices/61537/

all4miller SSC-Addicted Points: 409 More actions · Answer 4

Thanks "SSC Eights!" but this wont for larger datasets due the heavy recursion.

Kingston Dhasian SSCoach Points: 19799 More actions · Answer 5

Add the MAXRECURSION option

SELECT thedate, val1, calc

FROM cte_final_data

OPTION ( MAXRECURSION 0 )

This will work for larger datasets( 0 stands for infinite here ). But yes, no idea how this will affect performance.

Kingston Dhasian

How to post data/code on a forum to get the best help - Jeff Moden
http://www.sqlservercentral.com/articles/Best+Practices/61537/

all4miller SSC-Addicted Points: 409 More actions · Answer 6

Unfortunately performance is not great. If I bump up the test case to 3K rows its 1 second for cursor based approach and 15 seconds for recursive cte.

Paul White SSC Guru Points: 150467 More actions · Answer 7

With the best will in the world, it's impossible to suggest an appropriate set-based solution without knowing more about what the complex financial calculation does. I realise you had to simplify things for a forum question, but those simplifications have removed the core of the problem on this occasion.

This seems to be a variant of the 'running totals' problem since the value for row n depends partly on the value in row (n-1). A more efficient set-based solution can usually be found in this class of problem, but it does depend on the details. Also, any efficient set-based solution will likely require a rewrite of whatever function or stored procedure currently performs the required magic.

Whether it is worth pursuing a set-based alternative, depends heavily on the characteristics of the existing method. Is it running unacceptably slowly at the moment? Is this slowness definitely caused by an existing cursor or loop?

Often, an improvement to the algorithm, indexing, or data structures can provide more of a benefit that simply replacing a loop with set-based code. So...it depends 🙂

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

all4miller SSC-Addicted Points: 409 More actions · Answer 8

The real world calculation is Cumulative Performance which I can post the formula on Monday. The only complexity for me lies on the fact that I need the previous rows value for the same column which is a calculation. I don't think you can really re-engineer this, you basically loop the through an ordered result set, the first row is always zero, thereafter the value is a calculation based on other values in the same row and the previous result of the calculation. The tricky part is the fact you need the previous value of a calculation for the same column. If you required another columns previous value or the previous value of the same column that was a static this would be a trivial task.

Paul White SSC Guru Points: 150467 More actions · Answer 9

Well that all sounds very much like a problem that can be solved and optimised with tweaked 'running totals' code or a custom aggregate.

You don't need to post the exact formula or data, just something representative of the problem you face, in way that is easy to work on.

Please also explain any non-obvious terms, for example "Cumulative Performance" means nothing special to me 🙂

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

all4miller SSC-Addicted Points: 409 More actions · Answer 10

ColdCoffee was on the money the "quirky update" method worked a treat. Thanks ColdCoffee!

Here is the complete solution

declare @calc integer

declare @data table(

thedate datetime,

val1 integer,

calc integer)

insert into @data (thedate, val1) values ('2006/12/21',3500)

insert into @data (thedate, val1) values ('2006/12/22',3500)

insert into @data (thedate, val1) values ('2006/12/23',3500)

insert into @data (thedate, val1) values ('2006/12/24',3500)

insert into @data (thedate, val1) values ('2006/12/25',3500)

insert into @data (thedate, val1) values ('2006/12/26',3500)

insert into @data (thedate, val1) values ('2006/12/27',3500)

;with results

as (select top 2147483647 thedate,

val1,

calc

from @data

order by thedate)

update results

set @calc = calc = case

when @calc is null then 0

else val1 + @calc

end

output inserted.thedate,

inserted.val1,

inserted.calc

NB# I'm not sure if the top and order by are enough to guarantee the correct order if someone can comment on that will be great

Cheers

Jeff Moden SSC Guru Points: 1004363 More actions · Answer 11

all4miller (5/31/2010)
NB# I'm not sure if the top and order by are enough to guarantee the correct order if someone can comment on that will be great

Heh... I'm thinking that you didn't actually read the article ColdCoffee provided a link to. Read it and then come back and tell us what essential part that code is missing.

The reason I'm being such a "jerk" about this is that this technique is very controversial and rightly so. If you do it wrong you could seriously damage some data. The only way to learn the right way to do it is to read the article to achieve a deep understanding of the rules and then be able to recite the rules from memory. OSHA has declared that if one more person rides my hiney about the technique that I need to install hand rails on my butt. 😛 Either study the article so you can always do it right or don't use the technique... I don't need someone getting it wrong. 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

all4miller SSC-Addicted Points: 409 More actions · Answer 12

Thanks for commenting Jeff. I actually got my result from the The ‘Subscription List’ SQL Problem but since the part I used is essentially the same as your post I thought ColdCoffee deserved the credit.

I re-read your article specially the rules section and I guess the stuff I'm missing is stuff that would be implicetly done by sql server behind the scenes... or not..... I have only been sql server for 5 months 🙂

Please let me know what I missed

The RULES

1. CLUSTERED INDEX MUST BE PRESENT IN THE CORRECT ORDER: you cannot create a clustered index on a table variable

3. DON'T WORK AGAINST PARTITIONED STRUCTURES: not an issue in this case

4. USE THE TABLOCKX HINT: If you're updating a local Temp Table, only the current session has access to that Temp Table which means that the TABLOCKX hint really isn't required. also not an issue in this case

5. DO NOT USE JOINS: not an issue in this case

6. YOU MUST HAVE AN "ANCHOR" COLUMN: I think this could be the missing factor yes?

7. DO NOT USE ORDER BY: you cannot have a clustered index on a table variable so I had to use an order by

8. DO NOT USE INDEX HINTS TO TRY TO FORCE ORDER: not an issue in this case

So I think it's either the missing anchor column or bascially table variables just wont work?

Cheers

Jeff Moden SSC Guru Points: 1004363 More actions · Answer 13

all4miller (5/31/2010)
Thanks for commenting Jeff. I actually got my result from the The ‘Subscription List’ SQL Problem but since the part I used is essentially the same as your post I thought ColdCoffee deserved the credit.
I re-read your article specially the rules section and I guess the stuff I'm missing is stuff that would be implicetly done by sql server behind the scenes... or not..... I have only been sql server for 5 months 🙂
Please let me know what I missed
The RULES
1. CLUSTERED INDEX MUST BE PRESENT IN THE CORRECT ORDER: you cannot create a clustered index on a table variable
3. DON'T WORK AGAINST PARTITIONED STRUCTURES: not an issue in this case
4. USE THE TABLOCKX HINT: If you're updating a local Temp Table, only the current session has access to that Temp Table which means that the TABLOCKX hint really isn't required. also not an issue in this case
5. DO NOT USE JOINS: not an issue in this case
6. YOU MUST HAVE AN "ANCHOR" COLUMN: I think this could be the missing factor yes?
7. DO NOT USE ORDER BY: you cannot have a clustered index on a table variable so I had to use an order by
8. DO NOT USE INDEX HINTS TO TRY TO FORCE ORDER: not an issue in this case
So I think it's either the missing anchor column or bascially table variables just wont work?
Cheers

Heh... oddly enough, the post above left out the rule of concern here...

[font="Arial Black"]2. PARALLELISM MUST BE PREVENTED:[/font] You MUST prevent parallelism from occurring. Therefore, you MUST include OPTION (MAXDOP 1) in any such code.

So, if you make that small tweek to your original code, you'll be golden... well, except for the fact that I don't yet trust the "order by" implied by the ROW_NUMBER() because I've had it break on a slightly different problem. It's probably OK here, though, but only because the table variable was created as a heap...

declare @calc integer

declare @data table(

thedate datetime,

val1 integer,

calc integer)

insert into @data (thedate, val1) values ('2006/12/21',3500)

insert into @data (thedate, val1) values ('2006/12/22',3500)

insert into @data (thedate, val1) values ('2006/12/23',3500)

insert into @data (thedate, val1) values ('2006/12/24',3500)

insert into @data (thedate, val1) values ('2006/12/25',3500)

insert into @data (thedate, val1) values ('2006/12/26',3500)

insert into @data (thedate, val1) values ('2006/12/27',3500)

;with results

as (select top 2147483647 thedate,

val1,

calc

from @data

order by thedate)

update results

set @calc = calc = case

when @calc is null then 0

else val1 + @calc

end

output inserted.thedate, --this is a resonable substitute for an "anchor" column but I'd still take the time to include one

inserted.val1,

inserted.calc

OPTION (MAXDOP 1) --absolutely required

Also, you state that you can't put a clustered index on a table variable... any bets there? 😉

declare @calc integer,

@TheDate DATETIME

declare @data table(

thedate datetime PRIMARY KEY CLUSTERED, --Look! Clustered index gets made by this

val1 integer,

calc integer)

insert into @data (thedate, val1) values ('2006/12/21',3500)

insert into @data (thedate, val1) values ('2006/12/22',3500)

insert into @data (thedate, val1) values ('2006/12/23',3500)

insert into @data (thedate, val1) values ('2006/12/24',3500)

insert into @data (thedate, val1) values ('2006/12/25',3500)

insert into @data (thedate, val1) values ('2006/12/26',3500)

insert into @data (thedate, val1) values ('2006/12/27',3500)

update @data

set @calc = calc = case

when @calc is null then 0

else val1 + @calc

end,

@TheDate = TheDate --Anchor, a much better guarantee than an OUTPUT statement.

output inserted.thedate,

inserted.val1,

inserted.calc

FROM @data --Can't put a WITH(TABLOCKX) on a table variable but would do if this was a temp table

OPTION (MAXDOP 1) --absolutely required

Either way is fine but both absolutely must have the OPTION (MAXDOP 1) code to prevent parallelism. Yeah, I know... lot's of folks are going to come back and say the sorted ROW_NUMBER() makes it so it doesn't matter if parallelism occurs... I have some code at work on another method (ie, not the quirky update) where ROW_NUMBER() didn't guarantee the order until I added MAXDOP 1. I'll try to remember to post it here. In the meantime, don't take the chance because I've seen parallelism destroy the order of ROW_NUMBER() before. In fact, I'll trust the update order based on a clustered index long before I trust the order of ROW_NUMBER() during an update.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

all4miller SSC-Addicted Points: 409 More actions · Answer 14

all4miller

SSC-Addicted

Points: 409

May 31, 2010 at 2:52 pm

#1174592

Sweet thanks Jeff! 🙂