Calculating running totals for partitions of data

Question

Post reply

Calculating running totals for partitions of data

kdejan87

Right there with Babe

Points: 737
More actions
September 23, 2015 at 7:46 am

#320853

I have table named #t1 and three columns. CODE, Column1, and Column2.
create table #t1 (
CODE NVARCHAR(20),
COLUMN1 NUMERIC(18,2),
COLUMN2 NUMERIC(18,2)
)
And i have some data:
INSERT INTO #t1 (CODE,COLUMN1,COLUMN2)
VALUES ('432', 0,100),
('TOTAL FOR 432',0,100),
('4320001',0,250),
('4320001',50,0),
('4320001',0,140),
('4320001',300,0),
('TOTAL FOR 4320001',350,390),
('432002',200,0),
('432002',0,100),
('TOTAL FOR 432002',200,100)
drop table #t1
I want to have 4 column (named BALANCE). Balance must be column that represent running totals between two columns (Column1 - Column2) for each group of data. For each group total must start from zero.
Output:
Take a look picture!
After total 432 it starts to count again for total 4320001 and again for total 432002..... How can i get this result?
I'm using MS SQL SERVER 2014

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply

higgim SSCarpal Tunnel Points: 4090 More actions · Answer 1

We don't use SQL2014 where I work but from a quick search the following link may be of use

https://msdn.microsoft.com/en-us/library/hh231256.aspx

That said, I think based on your test data the "Total For" row may cause an issue.

That said below is a quick attempt at it in 2008 using a self join and no doubt someone will come along with a much more efficient way to do it 😉

The row count and rank part was the only way I could think to not have Total For included and not affect the running total and ensure it always came at the end of the data for the group.

DECLARE @SomeData TABLE

(

CODE NVARCHAR(20) ,

COLUMN1 NUMERIC(18, 2) ,

COLUMN2 NUMERIC(18, 2)

)

INSERT INTO @SomeData

( CODE, COLUMN1, COLUMN2 )

VALUES ( N'432', 0, 100 ),

( 'TOTAL FOR 432', 0, 100 ),

( N'4320001', 0, 250 ),

( N'4320001', 50, 0 ),

( N'4320001', 0, 140 ),

( N'4320001', 300, 0 ),

( 'TOTAL FOR 4320001', 350, 390 ),

( '432002', 200, 0 ),

( '432002', 0, 100 ),

( 'TOTAL FOR 432002', 200, 100 );

WITH DataRanked

AS ( SELECT sd.CODE ,

sd.COLUMN1 ,

sd.COLUMN2 ,

sd.COLUMN1 - sd.column2 AS PreTotal ,

ROW_NUMBER() OVER ( PARTITION BY sd.CODE ORDER BY sd.CODE ) AS RCount ,

RANK() OVER ( PARTITION BY REPLACE(sd.CODE,

'TOTAL FOR ', '') ORDER BY sd.CODE ) AS FinalSorter

FROM @SomeData AS sd

)

SELECT b.CODE ,

b.COLUMN1 ,

b.COLUMN2 ,

SUM(ISNULL(b2.PreTotal, 0)) AS RTotal

FROM DataRanked b

LEFT OUTER JOIN DataRanked b2 ON b.CODE = b2.CODE

AND b2.RCount <= b.RCount

GROUP BY b.CODE ,

b.RCount ,

b.FinalSorter ,

b.COLUMN1 ,

b.COLUMN2

ORDER BY REPLACE(b.CODE, 'TOTAL FOR ', '') ,

b.FinalSorter ,

b.RCount

Steve Jones - SSC Editor SSC Guru Points: 736162 More actions · Answer 2

You can use an OVER() to help here, but you have no ordering, so how can you get a running total? How do you know that ( N'4320001', 0, 140 ) comes before ( N'4320001', 300, 0 )? The order of insert has zero impact on how the rows are stored.

higgim SSCarpal Tunnel Points: 4090 More actions · Answer 3

I did notice no ordering on the data in the table but failed to mention it. My bad.

kdejan87 Right there with Babe Points: 737 More actions · Answer 4

What do you think about this:

DECLARE @t1 TABLE (

CODE nvarchar(20),

COLUMN1 numeric(18, 2),

COLUMN2 numeric(18, 2)

)

INSERT INTO @t1 (CODE, COLUMN1, COLUMN2)

VALUES ('432', 0, 100),

('TOTAL FOR 432', 0, 100),

('4320001', 0, 250),

('4320001', 50, 0),

('4320001', 0, 140),

('4320001', 300, 0),

('TOTAL FOR 4320001', 350, 390),

('432002', 200, 0),

('432002', 0, 100),

('TOTAL FOR 432002', 200, 100)

-- CTE

;

WITH CTE

AS (SELECT

*,

ROW_NUMBER() OVER (ORDER BY @@rowcount) RowNum -- Order as per the input

FROM @t1)

-- get from CTE

SELECT

*,

SUM(COLUMN1 - COLUMN2) OVER (PARTITION BY code ORDER BY RowNum) AS Balance --sum using PARTITION

FROM CTE

ORDER BY RowNum

WayneS SSC Guru Points: 95460 More actions · Answer 5

Here is a method that is usually faster.

But first, here is the data, adding a few columns and not inserting those total rows:

IF OBJECT_ID('tempdb.dbo.#t1') IS NOT NULL DROP TABLE #t1;

CREATE table #t1 (

RowID INTEGER IDENTITY PRIMARY KEY CLUSTERED,

CODE NVARCHAR(20),

COLUMN1 NUMERIC(18,2),

COLUMN2 NUMERIC(18,2),

RunningTotal NUMERIC(18,2)

);

INSERT INTO #t1 (CODE,COLUMN1,COLUMN2)

VALUES ('432', 0,100),

--('TOTAL FOR 432',0,100),

('4320001',0,250),

('4320001',50,0),

('4320001',0,140),

('4320001',300,0),

--('TOTAL FOR 4320001',350,390),

('432002',200,0),

('432002',0,100)--,

--('TOTAL FOR 432002',200,100);

This next part does a "Quirky Update". See remarks within the code.

-- this part using a "Quirky Update" - the fastest way to do running totals

-- but, this has a bunch of rules that need to be followed.

-- these rules can be found at http://www.sqlservercentral.com/articles/T-SQL/68467/

DECLARE @Sequence INTEGER, -- for the safety check

@RowID INTEGER, -- for the anchor column

@Code NVARCHAR(20),

@Balance NUMERIC(18,2);

SET @Sequence = 0;

SET @Code = N'';

WITH SafetyCheck AS

(

SELECT RowID,

CODE,

COLUMN1,

COLUMN2,

LineTotal = COLUMN1 - COLUMN2,

RunningTotal,

[Sequence] = ROW_NUMBER() OVER (ORDER BY RowID)

FROM #t1

)

UPDATE sc

SET @RowID = sc.RowID,

@Balance = CASE WHEN @Code = Code THEN @Balance + LineTotal ELSE LineTotal END, -- reset balance when code changes, otherwise accumulate the running total

RunningTotal = @Balance,

@Code = Code,

@Sequence = CASE WHEN @Sequence+1 = Sequence THEN @Sequence + 1 ELSE 1/0 END -- Safety Check aborts if not being calculated in the proper order

FROM SafetyCheck sc

SELECT *

FROM #t1;

Here is what I would use on SQL Server 2012 or higher. It's simpler, almost as fast, and doesn't have all of those undocumented (by MS) rules to make it work:

-- SQL SERVER 2012 version:

SELECT RowID,

CODE,

COLUMN1,

COLUMN2,

RunningTotal = SUM(COLUMN1 - COLUMN2) OVER (PARTITION BY CODE ORDER BY RowID)

FROM #t1

ORDER BY RowID;

Finally:

IF OBJECT_ID('tempdb.dbo.#t1') IS NOT NULL DROP TABLE #t1;

Both of these methods produce the following results:

RowID CODE COLUMN1 COLUMN2 RunningTotal

----- ------- ------- ------- ------------

1 432 0.00 100.00 -100.00

2 4320001 0.00 250.00 -250.00

3 4320001 50.00 0.00 -200.00

4 4320001 0.00 140.00 -340.00

5 4320001 300.00 0.00 -40.00

6 432002 200.00 0.00 200.00

7 432002 0.00 100.00 100.00

Does this look like what you want?

Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes

If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!

Links:
For better assistance in answering your questions
Performance Problems
Common date/time routines
Understanding and Using APPLY Part 1 & Part 2

kdejan87 Right there with Babe Points: 737 More actions · Answer 6

kdejan87

Right there with Babe

Points: 737

September 23, 2015 at 2:15 pm

#1828946

View fourth answer and tell me your opinion

Alan Burstein SSC Guru Points: 61141 More actions · Answer 7

kdejan87 (9/23/2015)
What do you think about this:
DECLARE @t1 TABLE (
CODE nvarchar(20),
COLUMN1 numeric(18, 2),
COLUMN2 numeric(18, 2)
)
INSERT INTO @t1 (CODE, COLUMN1, COLUMN2)
VALUES ('432', 0, 100),
('TOTAL FOR 432', 0, 100),
('4320001', 0, 250),
('4320001', 50, 0),
('4320001', 0, 140),
('4320001', 300, 0),
('TOTAL FOR 4320001', 350, 390),
('432002', 200, 0),
('432002', 0, 100),
('TOTAL FOR 432002', 200, 100)
-- CTE
;
WITH CTE
AS (SELECT
*,
ROW_NUMBER() OVER (ORDER BY @@rowcount) RowNum -- Order as per the input
FROM @t1)
-- get from CTE
SELECT
*,
SUM(COLUMN1 - COLUMN2) OVER (PARTITION BY code ORDER BY RowNum) AS Balance --sum using PARTITION
FROM CTE
ORDER BY RowNum

Don't do that! SUM OVER will crush your performance. Even if you remove the ORDER BY at the end of your query (which is not required) you'll see that the query plan produces an ugly sort. With the ORDER BY you get and additional SORT.

Below is the sample data I was working with. Like Steve Suggested and Wayne did, I added a ROWID so we have something to sort by.

-- Added ROWID so that we have something to sort by

-- made it a primary key so that the query plan does not produce a sort

IF OBJECT_ID('tempdb..#t1') IS NOT NULL DROP TABLE #t1

GO

CREATE TABLE #t1

(

ROWID int identity primary key,

CODE NVARCHAR(20),

COLUMN1 NUMERIC(18,2),

COLUMN2 NUMERIC(18,2)

)

INSERT INTO #t1 (CODE,COLUMN1,COLUMN2)

VALUES ('432', 0,100),

('TOTAL FOR 432',0,100),

('4320001',0,250),

('4320001',50,0),

('4320001',0,140),

('4320001',300,0),

('TOTAL FOR 4320001',350,390),

('432002',200,0),

('432002',0,100),

('TOTAL FOR 432002',200,100);

Below is your solution (a simplified version that produces the same query plan as what you posted). I included two correlated subqueries; one with an ORDER BY and one with out. If you run this you'll see a few things:

SET NOCOUNT ON;

SET STATISTICS IO ON;

PRINT 'QUERY 1: SUM + PARTITION';

SELECT

*,

SUM(COLUMN1 - COLUMN2) OVER (PARTITION BY code ORDER BY ROWID) AS Balance --sum using PARTITION

FROM #t1;

PRINT 'QUERY 2: SUM + PARTITION + ORDER BY ';

SELECT

*,

SUM(COLUMN1 - COLUMN2) OVER (PARTITION BY code ORDER BY ROWID) AS Balance --sum using PARTITION

FROM #t1

ORDER BY ROWID -- slows things down...

PRINT 'QUERY 3: CORRELATED SUBQUERY'

SELECT *, (SELECT SUM(COLUMN1 - COLUMN2) FROM #t1 t2 WHERE t1.ROWID=t2.ROWID AND T1.CODE=t2.CODE)

FROM #t1 t1

PRINT 'QUERY 4: CORRELATED SUBQUERY with sort'

SELECT *, (SELECT SUM(COLUMN1 - COLUMN2) FROM #t1 t2 WHERE t1.ROWID=t2.ROWID AND T1.CODE=t2.CODE)

FROM #t1 t1

ORDER BY ROWID;

SET STATISTICS IO OFF;

First, SUM OVER requires a sort (two with the ORDER BY statement) even though the ORDER BY is sorting on the clustered index. The two correlated subqueries make better use of the index. Plus they work on any version of SQL Server.

Correlated subqueries are NOT the way to go. Wayne showed a couple better solution. I just used them to show how inefficient SUM OVER is.

Lastly (this is the reason for my edit)... You'll also notice that I turned STATISTICS IO ON. If you examine the statistics for my four queries I posted you'll notice that the SUM OVER solutions both scan #t1 16 times (and that number will keep going up the more rows you add). The correlated subqueries scan the table twice, once for the outer query, once for the inner query. ROW OVER has nearly three times as many reads and that number will keep going up as you add rows.

In summary, Correlated subquery = bad; SUM OVER = very, very bad.

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

Alan Burstein SSC Guru Points: 61141 More actions · Answer 8

@Wayne

Was this your intended 2012 solution:

-- SQL SERVER 2012 version:

SET STATISTICS IO ON;

SELECT

RowID,

CODE,

COLUMN1,

COLUMN2,

RunningTotal =

SUM(COLUMN1 - COLUMN2) OVER

(

PARTITION BY CODE ORDER BY RowID

ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

)

FROM #t1

ORDER BY RowID;

@kdejan87 - using 2012 framing (ROWS BETWEEN UNBOUNDED....) makes SUM OVER okay... Referring back to my queries with Statistics IO on, this solution blows the correlated subquery away

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table '#t1_________________________________________________________________________________________________________________00000000000D'.

Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

Jeff Moden SSC Guru Points: 1004313 More actions · Answer 9

kdejan87 (9/23/2015)
View fourth answer and tell me your opinion

[font="Arial Black"]ALL [/font]the answers are incorrect because there is no column that enforces the order of the data in the original data. Adding a ROWID won't help to do it right. If either in incrementing number or a transaction date/time is not available in the original data, all answers will continue to be incorrect. 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004313 More actions · Answer 10

kdejan87,

Does the table have any kind of column(s) that would indicate the correct order of the transactions?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

WayneS SSC Guru Points: 95460 More actions · Answer 11

Alan.B (9/23/2015)
@Wayne
Was this your intended 2012 solution:
-- SQL SERVER 2012 version:
SET STATISTICS IO ON;
SELECT
RowID,
CODE,
COLUMN1,
COLUMN2,
RunningTotal =
SUM(COLUMN1 - COLUMN2) OVER
(
PARTITION BY CODE ORDER BY RowID
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)
FROM #t1
ORDER BY RowID;
@kdejan87 - using 2012 framing (ROWS BETWEEN UNBOUNDED....) makes SUM OVER okay... Referring back to my queries with Statistics IO on, this solution blows the correlated subquery away
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#t1_________________________________________________________________________________________________________________00000000000D'.
Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Alan,

good catch, I did forget the ROWS clause.

Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes

If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!

Links:
For better assistance in answering your questions
Performance Problems
Common date/time routines
Understanding and Using APPLY Part 1 & Part 2

Steve Jones - SSC Editor SSC Guru Points: 736162 More actions · Answer 12

Jeff Moden (9/23/2015)
kdejan87 (9/23/2015)
View fourth answer and tell me your opinion
[font="Arial Black"]ALL [/font]the answers are incorrect because there is no column that enforces the order of the data in the original data. Adding a ROWID won't help to do it right. If either in incrementing number or a transaction date/time is not available in the original data, all answers will continue to be incorrect. 😉

+1. This.