Running totals with OVER clause

Question

Running totals with OVER clause

Viewing 15 posts - 16 through 30 (of 33 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1004436 More actions · Answer 1

TheSQLGuru (5/30/2015)
Jason, I don't think we can call your test apples-to-apples, at least not "real-world" anyway. At least 95% of the real-world running totals cases I have ever seen are a) partitioned by one or more columns b) do not have any index that would help either method and c) are output queries, not updating a column on a table. Also IIRC there are something like 6 rules that MUST be adhered to when doing the Quirky Update and I think you only have one of them (the clustered index).
Whenever you get around to your next test can you construct it along those lines please? No worries if you are like me and too busy to get it set up! 🙂

Part of the problem with Aaron's good article is that he didn't follow the rules either. For example, he allows parallelism to occur and that 1) doesn't follow the rules and 2) could make quite the change in performance. As we all know, parallelism does sometimes cause things to run more slowly depending on the nature of the task at hand. Wayne Sheffield did some testing when 2012 was in CTP and the QU smoked even the faster of the newer methods.

I'm up to my eyes in 3 major projects for work and probably won't be able to do my own testing on this over the weekend but, because of the disparity in findings between Wayne's previous testing and Aaron's tests, you can just bet that I'm going to test it.

In the meantime and because of the disparity between the methods, the fact that Waynes testing was pre-RTM, and other problems with the current testing (rules not followed, for example), I wouldn't be making any statements as to which one is currently faster.

Also, on the subject of indexes that would help any method, that's why I almost always do the QU on a copy of the data in a Temp Table. When I get the chance, I'll try it both ways.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jason A. Long SSC-Insane Points: 23786 More actions · Answer 2

TheSQLGuru (5/30/2015)
Jason, I don't think we can call your test apples-to-apples, at least not "real-world" anyway. At least 95% of the real-world running totals cases I have ever seen are a) partitioned by one or more columns b) do not have any index that would help either method and c) are output queries, not updating a column on a table. Also IIRC there are something like 6 rules that MUST be adhered to when doing the Quirky Update and I think you only have one of them (the clustered index).
Whenever you get around to your next test can you construct it along those lines please? No worries if you are like me and too busy to get it set up! 🙂

I both agree and disagree with your comments... I do agree that the vast majority of real world cases do require partitioning, existing indexes may or may not be ideal for any given task and finally, and agree that we're far more likely to output the RT column in a select statement than we are to persist the RT with an update statement.

But... Here's the thing... If any one of those points are an actual requirement, the quirky method isn't capable of meeting any of them... It's not even an option, and therefore, there's nothing to test...

So... If we want to pit the quirky method against a windowed function method (or any other method), we have to base the testing on a hypothetical scenario (not matter how unlikely it is to exist in our actual day to day workloads) where the quirky method actually would/could work.

When I called it an "apples to apples" comparison, it was within that context...

As far as " 6 rules that MUST be adhered to"... I'm assuming that you're referring to Jeff Moden's article, "Solving the Running Total and Ordinal Rank Problems (Rewritten)"...

I just pulled up the article and it looks like the list has been updated to 10 rules... (or I'm simply wrong about the 6 rules you are referring to... )

1. CLUSTERED INDEX MUST BE PRESENT IN THE CORRECT ORDER: Check... That's covered by the pk_RT1 primary key.

2. PARALLELISM MUST BE PREVENTED: Good call... My test did not include "OPTION(MAXDOP 1)". The test plan didn't generate parallelism w/o the hint in this instance, but I will be sure to include it in any future tests.

3. DON'T WORK AGAINST PARTITIONED STRUCTURES: Check... As noted earlier, there was no attempt to partition anything.

4. USE THE TABLOCKX HINT: Once again... Guilty... 🙁 Although, I think that the fact that I ran the test on my local workstation instance qualifies me for the #Temp table exemption... 😉 In any case, I will also include the TABLOCKX hint in future tests.

5. DO NOT USE JOINS: Check... No joins.

6. YOU MUST HAVE AN "ANCHOR" COLUMN: I hadn't actually considered using this before. I'm not offering excuses... It just never occurred to me. Putting this on the "things to include next time" list...

7. DO NOT USE ORDER BY: Check...

8. DO NOT USE INDEX HINTS TO TRY TO FORCE ORDER: Check...

9. GET IT RIGHT: I clearly did miss a few of the rules, but, I did verify that both methods were generating correct results.

10. TEST! ... 😀

I do plan on getting back into this, this weekend (cutting grass and cleaning the kitchen need to come first) and I'm wide open to suggestions.

At this point, the plan is to rerun the previous test, this time observing rules 2, 4 & 6...

Kevin and or Jeff - If either of you (or anyone else) notice any other problems or have specific suggestions, please let me know. I don't pretend to be either of your levels and any input is welcome and appreciated. Especially as it pertains to valid testing procedures and measuring methods.

Also... Jeff - I "borrowed" your "10 Rules" w/o your permission. If that's a NO-NO, please let me know and I'll update the post to remove them immediately.

Jeff Moden SSC Guru Points: 1004436 More actions · Answer 3

Jason A. Long (5/30/2015)
I both agree and disagree with your comments... I do agree that the vast majority of real world cases do require partitioning, existing indexes may or may not be ideal for any given task and finally, and agree that we're far more likely to output the RT column in a select statement than we are to persist the RT with an update statement.
But... Here's the thing... If any one of those points are an actual requirement, the quirky method isn't capable of meeting any of them... It's not even an option, and therefore, there's nothing to test...

Ah... be careful now. I don't believe Kevin is talking about "partitioning" as in "table partitioning". I believe that he means that there's some horizontal logic partitioning that's usually required where the running total would need to be reset when, for an example, an account number changes. The Quirky Update is very capable of that.

It's also capable of handling partitioned tables but usually not directly because most partition tables are based on a temporal column rather than on a combination of an account and temporal column.

No problem with "borrowing" the rules. It's called "surgical citing". 😀

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004436 More actions · Answer 4

Also, the TABLOCKX hint is valuable even on Temp Tables. It prevents the need for any lock escalation, which can take a bit of extra time depending on other factors. At the very least, it prevents the memory usage involved with taking multiple locks.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jason A. Long SSC-Insane Points: 23786 More actions · Answer 5

Jeff Moden (5/30/2015)
Ah... be careful now. I don't believe Kevin is talking about "partitioning" as in "table partitioning". I believe that he means that there's some horizontal logic partitioning that's usually required where the running total would need to be reset when, for an example, an account number changes. The Quirky Update is very capable of that.

I stand corrected... I was referring to partitioning in the "horizontal logic" context. For some reason I had it in my head that quirky wouldn't allow the data to be partitioned... Now that I'm actually thinking about it, a "partition" variable could be used to detect changes in the partitioning column. I'll go back a reread your article and see how you handle it... As always, thank you for setting me straight. 😀

It's also capable of handling partitioned tables but usually not directly because most partition tables are based on a temporal column rather than on a combination of an account and temporal column.

None of our databases have partitioned tables so I literally have 0 experience using them. When the time comes, I'm sure I'll be here with plenty of questions.

No problem with "borrowing" the rules. It's called "surgical citing". 😀

Thank you. 🙂

Jason A. Long SSC-Insane Points: 23786 More actions · Answer 6

Anyone see any flaws in this version? I think I've got the bases covered, but I'd feel better with a 2nd set of eyes...

/* ================================================================

Create a simple 1M test table with a single key column clustered index

================================================================ */

IF OBJECT_ID('dbo.RunningTotal_1', 'U') IS NOT NULL

DROP TABLE dbo.RunningTotal_1;

WITH n (n) AS (-- creates a million row tally table

SELECT 1 FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) n (n)

), Tally (n) AS (

SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

FROM n n1, n n2, n n3, n n4, n n5, n n6

)

SELECT

ISNULL(t.n, 0) AS TransactionID,

ISNULL(DATEADD(mi, t.n, '2010-01-01'), '1900-01-01') AS DateTimeStamp,

CAST(ISNULL(DATEADD(mi, t.n, '2010-01-01'), '1900-01-01') AS DATE) AS TransactionDay,

CAST(1.0 AS NUMERIC(19,4)) AS TransactionAmount,

CAST(NULL AS NUMERIC(19,4)) AS RunningTotal

INTO dbo.RunningTotal_1

FROM

Tally t;

--==========================================================================

-- Add a clustered index for the "Quirky" update --

ALTER TABLE dbo.RunningTotal_1 ADD CONSTRAINT pk_RT1 PRIMARY KEY CLUSTERED (TransactionID);

--==========================================================================

-- Quirky Update --

DECLARE

@rt NUMERIC(19,4) = 0,

@tm DATE = '1900-01-01',

@anchor INT = 0;

UPDATE rt1 SET

@anchor = rt1.TransactionID,

@rt = rt1.RunningTotal = CASE WHEN rt1.TransactionDay = @tm THEN @rt + rt1.TransactionAmount ELSE rt1.TransactionAmount END,

@tm = rt1.TransactionDay

FROM

dbo.RunningTotal_1 rt1 WITH (TABLOCKX)

OPTION(MAXDOP 1);

--==========================================================================

-- Add a POC index for the Windowed Update (Do not create until the "Quirky" test is complete...) --

CREATE UNIQUE NONCLUSTERED INDEX ix_RunningTotal1_TransactionDay_DateTimeStamp ON dbo.RunningTotal_1 (

TransactionDay,

DateTimeStamp

)

INCLUDE (

TransactionAmount,

RunningTotal

)

WITH (DATA_COMPRESSION = PAGE, FILLFACTOR = 78);--(updating RunningTotal from NULL was causing 99% fragmentation. FF=78 allows the update w/o experiencing the fragmentation)

--==========================================================================

-- Windowed Update --

UPDATE x SET x.RunningTotal = x.rt

FROM (

SELECT

rt1.RunningTotal,

SUM(rt1.TransactionAmount) OVER (PARTITION BY rt1.TransactionDay ORDER BY rt1.DateTimeStamp ROWS UNBOUNDED PRECEDING) rt

FROM

dbo.RunningTotal_1 rt1

) x;

Yea... Nay... Other???

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 7

Several things:

1) Main one is that you are putting NULL into the running total column to start. You KNOW you are creating fragementation with page splits. So why didn't you just populate it with the appropriate default of 0.00? 🙂 Now both indexes can be 100%.

2) Why are you using page compression on an index that is going to be updated a ton?

3) I am really tired this morning, so I can't figure out what the need is for the ISNULL(,1/1/1900) constructs.

4) I still look forward to the real "real-life" test where the running total is partitioned by one or more columns (say a CustomerID and/or ProductID, et al) and it is done as a SELECT output not an UPDATE. This will require you to put the data for the Quirky Update into a temp table to cover all the bases such as clustered index and pay the cost for doing that (which the window function solution doesn't need, although it will need a sort in the operation carrying mostly the same effect).

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

Jeff Moden SSC Guru Points: 1004436 More actions · Answer 8

TheSQLGuru (5/31/2015)
Several things:
1) Main one is that you are putting NULL into the running total column to start. You KNOW you are creating fragementation with page splits. So why didn't you just populate it with the appropriate default of 0.00? 🙂 Now both indexes can be 100%.

I'm pretty tired myself but, since the column is a fixed length datatype, I'm pretty sure there won't be any page splits because of setting the column to NULL.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jason A. Long SSC-Insane Points: 23786 More actions · Answer 9

TheSQLGuru (5/31/2015)
Several things:
1) Main one is that you are putting NULL into the running total column to start. You KNOW you are creating fragementation with page splits. So why didn't you just populate it with the appropriate default of 0.00? 🙂 Now both indexes can be 100%.

I just ran a few tests and I don't see where changing the default make a difference... (see the attached spreadsheet)

- When there is no index compression both can have a FF of 100% and the update doesn't appear to have any adverse impact on either of them.

- When Page Compression is on and FF=100, the update causes 99% +/- fragmentation in both cases.

- When Page Compression is on and FF=78, the update doesn't appear to have any adverse impact on either of them.

2) Why are you using page compression on an index that is going to be updated a ton?

The reasoning is that, even with the fill factor adjusted to accommodate the update, the number of pages was still more than cut in half when compared to the uncompressed page count.

3) I am really tired this morning, so I can't figure out what the need is for the ISNULL(,1/1/1900) constructs.

It's just a little trick for doing a "SELECT ... INTO" that make the resulting column NOT NULL. It's handy if you want to add a primary key after the table has been created.

It was a carry-over from the original table where DateTimeStamp was the PK. Since it's not the PK in the new version of the table it's not actually needed... But not hurting anything either.

4) I still look forward to the real "real-life" test where the running total is partitioned by one or more columns (say a CustomerID and/or ProductID, et al) and it is done as a SELECT output not an UPDATE. This will require you to put the data for the Quirky Update into a temp table to cover all the bases such as clustered index and pay the cost for doing that (which the window function solution doesn't need, although it will need a sort in the operation carrying mostly the same effect).

A partition column has been added (transaction day). Granted It's just the DateTimeStamp cast as a DATE... Not terribly interesting but it is something to partition on.

As to running the Quirky through a temp table... Now I see what you were originally driving at in the previous post. I like it and I'll do it. 😀

On points 1 & 2... I just want to acknowledge that what I don't know about indexes would fill a much bigger bucket than what I do know... If there are holes in my logic or I'm missing important pieces of the puzzle, please let me know.

Thanks,

Jason

Jason A. Long SSC-Insane Points: 23786 More actions · Answer 10

As promised... Something a little closer to "real life"... Select outputs rather than updates and a a partition has been added to the data.

Here is the new(ish) test table...

/* ================================================================

Create a simple 1M test table with a single key column clustered index

================================================================ */

IF OBJECT_ID('dbo.RunningTotal', 'U') IS NOT NULL

DROP TABLE dbo.RunningTotal;

WITH n (n) AS (-- creates a million row tally table

SELECT 1 FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) n (n)

), Tally (n) AS (

SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

FROM n n1, n n2, n n3, n n4, n n5, n n6

)

SELECT

ISNULL(t.n, 0) AS TransactionID,

DATEADD(mi, t.n, '2010-01-01') AS DateTimeStamp,

CAST(DATEADD(mi, t.n, '2010-01-01') AS DATE) AS TransactionDay,

CAST(1.0 AS NUMERIC(19,4)) AS TransactionAmount

INTO dbo.RunningTotal

FROM

Tally t;

--==========================================================================

-- Add a primary key... Because every good table should have one...

ALTER TABLE dbo.RunningTotal ADD CONSTRAINT pk_RunningTotal PRIMARY KEY CLUSTERED (TransactionID);

--==========================================================================

-- Add a POC index for the Windowed Update --

CREATE UNIQUE NONCLUSTERED INDEX ix_RunningTotal_TransactionDay_DateTimeStamp ON dbo.RunningTotal (

TransactionDay,

DateTimeStamp

)

INCLUDE (

TransactionAmount,

TransactionID

)

WITH (DATA_COMPRESSION = PAGE)

--==========================================================================

First up... The Quirky Method...

/* ================================================================

Quirky Method

================================================================ */

-- If the #TempRunningTotal currently exists, drop it.

IF OBJECT_ID('tempdb..#TempRunningTotal') IS NOT NULL

DROP TABLE #TempRunningTotal;

-- Select the data from dbo.RunningTotal into #TempRunningTotal

SELECT

rt.TransactionID,

rt.DateTimeStamp,

rt.TransactionDay,

rt.TransactionAmount,

CAST(0.00 AS NUMERIC(19,4)) AS RunningTotal

INTO #TempRunningTotal

FROM

dbo.RunningTotal rt;

-- Rule #1... Must have a clustered index that estabolishes the correct order...

CREATE UNIQUE CLUSTERED INDEX cix_TempRunningTotal_TransactionID ON #TempRunningTotal (TransactionID);

-- Declare the necessary parameters.

DECLARE

@rt NUMERIC(19,4) = 0,

@tm DATE = '1900-01-01',

@anchor INT = 0;

-- Perforn the actual Quirky Update...

UPDATE trt SET

@anchor = trt.TransactionID,-- Rule 6: Use an anchor based on the clustered index.

@rt = trt.RunningTotal = CASE WHEN trt.TransactionDay = @tm THEN @rt + trt.TransactionAmount ELSE trt.TransactionAmount END,

@tm = trt.TransactionDay

FROM

dbo.#TempRunningTotal trt WITH (TABLOCKX)-- Rule 4: Make sure you have an exclusive lock on the whole table through the entire transaction.

OPTION(MAXDOP 1);-- Rule 2: prevent parallelism.

-- Select the updated data from #TempRunningTotal.

SELECT

trt.TransactionID,

trt.DateTimeStamp,

trt.TransactionDay,

trt.TransactionAmount,

trt.RunningTotal

FROM

#TempRunningTotal trt

ORDER BY

trt.TransactionID;

-- Drop #TempRunningTotal.

DROP TABLE #TempRunningTotal;

Note... All executions were completed with "Discard results after execution" turned on, to remove display rendering from the result times.

Quirky results...

Now for the Windowed Function...

/* ================================================================

Windowed Function

================================================================ */

SELECT

rt.TransactionID,

rt.DateTimeStamp,

rt.TransactionDay,

rt.TransactionAmount,

SUM(rt.TransactionAmount) OVER (PARTITION BY rt.TransactionDay ORDER BY rt.DateTimeStamp ROWS UNBOUNDED PRECEDING) AS RunningTotal

FROM

dbo.RunningTotal rt

And the results...

When testing, I know that I'm not supposed to root for one solution over another... But... I have to admit that I'm kinda happy to see the Windowed version come out on top in this race. 😀

Edit... Added actual execution plans.

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 11

Thanks for that Jason!

When I want to test the server performance of a large set I usually do it thusly:

declare @var1, @var2, @var3, @var4, @var5 (appropriate definitions here)

SELECT

@var1 = rt.TransactionID,

@var2 = rt.DateTimeStamp,

@var3 = rt.TransactionDay,

@var4 = rt.TransactionAmount,

@var5 = SUM(rt.TransactionAmount) OVER (PARTITION BY rt.TransactionDay ORDER BY rt.DateTimeStamp ROWS UNBOUNDED PRECEDING) AS RunningTotal

FROM

dbo.RunningTotal rt

Now there is ZERO overhead for transferring data back to client (even if it is just to be discarded). We are truly testing server query performance.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

Jason A. Long SSC-Insane Points: 23786 More actions · Answer 12

TheSQLGuru (6/1/2015)
Thanks for that Jason!

No problem. Thanks for kicking me in the back side to get it done... With the data partition and the select output. 😀

When I want to test the server performance of a large set I usually do it thusly:
declare @var1, @var2, @var3, @var4, @var5 (appropriate definitions here)
SELECT
@var1 = rt.TransactionID,
@var2 = rt.DateTimeStamp,
@var3 = rt.TransactionDay,
@var4 = rt.TransactionAmount,
@var5 = SUM(rt.TransactionAmount) OVER (PARTITION BY rt.TransactionDay ORDER BY rt.DateTimeStamp ROWS UNBOUNDED PRECEDING) AS RunningTotal
FROM
dbo.RunningTotal rt
Now there is ZERO overhead for transferring data back to client (even if it is just to be discarded). We are truly testing server query performance.

I'm not familiar with that particular syntax... Are you defining the variables as table or scalar variables?

I like the "Discard results..." option simply because it's an easy toggle in SSMS, it doesn't require any alteration to the code being tested and doesn't have any impact on either the estimated or actual plans...

That said, I'm always up for learning something new and/or finding a better way of doing things. Do you have a link to a working example?

Thanks,

Jason

devsql123 SSC-Addicted Points: 478 More actions · Answer 13

Hi Folks i am in similar situation...can you please see my post and reply me.

My Post title is: "Trying to get running Total in SQL Server"

Thanks

Devsql

Jeff Moden SSC Guru Points: 1004436 More actions · Answer 14

devsql123 (6/2/2015)
Hi Folks i am in similar situation...can you please see my post and reply me.
My Post title is: "Trying to get running Total in SQL Server"
Thanks
Devsql

Eirikur's most recent answer on that post would seem to work correctly for 2012 and up. What's the problem there?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

j-1064772 SSCertifiable Points: 5316 More actions · Answer 15

TheSQLGuru (6/1/2015)
When I want to test the server performance of a large set I usually do it thusly:
declare @var1, @var2, @var3, @var4, @var5 (appropriate definitions here)
SELECT
@var1 = rt.TransactionID,
@var2 = rt.DateTimeStamp,
@var3 = rt.TransactionDay,
@var4 = rt.TransactionAmount,
@var5 = SUM(rt.TransactionAmount) OVER (PARTITION BY rt.TransactionDay ORDER BY rt.DateTimeStamp ROWS UNBOUNDED PRECEDING) AS RunningTotal
FROM dbo.RunningTotal rt
Now there is ZERO overhead for transferring data back to client (even if it is just to be discarded). We are truly testing server query performance.

[font="Comic Sans MS"]Never thought of this way of doing it. Absolutely clever - this one goes into my bag of tricks. Love this forum - learn something new every day.[/font]