February 14, 2008 at 5:06 pm
Christopher Ford (2/13/2008)
Earlier Jeff posted a question on using SUM(SAL) OVER()Since OVER() doesn't support ORDER BY when used with an aggregation.
How about this one?
SELECT J1.AccountID,
J1.Date,
J1.Amount AS AmountThisDate,
SUM(J2.Amount) AS RunningBalance,
CAST(AVG(1.*J2.Amount) AS DECIMAL(12, 2)) AS AvgRunningBalance
FROM dbo.JBMTest AS J1
JOIN dbo.JBMTest AS J2
ON J2.AccountID = J1.AccountID
AND (J2.Date >= '01/01/2000' --Anchor Date from example
AND J2.Date <= J1.Date)
GROUP BY J1.AccountID, J1.Date, J1.Amount
ORDER BY J1.AccountID, J1.Date;
Not quite as elegant as SUM(SAL) OVER()
But I'm willing to bet that the query plans generated for both would be similiar to this query.
Credit doesn't go to me for this method. Learned this, like many other things from other people. This method gets credited to Itzik, since I learned it from one of his many articles, but it's useful in so many circumstances.
And this one works on SQL 2000. =)
Like I said in the Triangular Join Article, not all Triangular Joins are bad if they're against small sets. This one only generates about 11 million internal rows which isn't bad for a million row table. And, in the presence of a good and proper index, it only takes 00:01:24 to execute... that makes it only about 12 times slower than the UPDATE method. Sounds like I'm being sarcastic, and I'm not... that's actually pretty good for something that uses triangular joins.
--Jeff Moden
Change is inevitable... Change for the better is not.
February 14, 2008 at 6:57 pm
Yep... tested the heck out of the CTE solution with ROW_NUMBER() for the running balance... without the clustered index on the appropriate columns, if falls apart in that the running balance starts over within the same AccountID... here's the code I used to find groups of AccountID's that have more than one "start" in a running balance. Pick any AccountID in the first query result and plug it into the second query to see what I'm talking about...
   FROM dbo.JBMTest 
  WHERE Amount = GrpBal
  GROUP BY AccountID
 HAVING COUNT(*) > 1 
  ORDER BY AccountID
SELECT * FROM JBMTEST WHERE AccountID = ???? ORDER BY ACCOUNTID,DATE
[/font]
The reason it fails is because it picks up the "other" index (PK) which turns out to be a Merry-Go-Round index. This also proves what Matt was saying... Row_Number() doesn't guarantee the correct order of processing... only the the Row_Number() will be in the correct order when compared to it's own ORDER BY clause.
Too bad... that was a really nice idea...
--Jeff Moden
Change is inevitable... Change for the better is not.
February 14, 2008 at 7:12 pm
This has all been totally awesome! Give yourselves a huge pat on the back... 122 posts on a fairly controversial subject and no flame wars. My hat is off to all of you! 🙂
Second of all, 122 posts of mostly great ideas even if some of them don't quite work as expected... at least now we know what works, what doesn't work, why it does or doesn't work, and how long most of them take. You can't BUY that kind of quality testing and we've all benefited. So, give yourselves another huge pat on the back... this has been an absolutely wonderful showing by some real professionals!
--Jeff Moden
Change is inevitable... Change for the better is not.
February 15, 2008 at 1:00 am
Jeff Moden (2/14/2008)
Yep... tested the heck out of the CTE solution with ROW_NUMBER() for the running balance... without the clustered index on the appropriate columns, if falls apart in that the running balance starts over within the same AccountID... here's the code I used to find groups of AccountID's that have more than one "start" in a running balance. Pick any AccountID in the first query result and plug it into the second query to see what I'm talking about...
[font="Courier New"]SELECT AccountID,COUNT(*) AS FalseReset   FROM dbo.JBMTest 
  WHERE Amount = GrpBal
  GROUP BY AccountID
 HAVING COUNT(*) > 1 
  ORDER BY AccountID
SELECT * FROM JBMTEST WHERE AccountID = ???? ORDER BY ACCOUNTID,DATE
[/font]
The reason it fails is because it picks up the "other" index (PK) which turns out to be a Merry-Go-Round index. This also proves what Matt was saying... Row_Number() doesn't guarantee the correct order of processing... only the the Row_Number() will be in the correct order when compared to it's own ORDER BY clause.
Too bad... that was a really nice idea...
Well...There's got to be a way to use the CTE and ROW_NUMBER() and guarantee an ordered update...
Other than just creating a clustered index...
Isn't there? I mean...Jeff, let's face it...several people here have had all previous beliefs shattered on what they believed the order to be. And also, several people who have been preaching this idea for a while are finally singing hallelujah praises that someone has proven this to be true...
I finally understand why SQL Server's default scripting behaviour favors inserting all your data into a new temp table, dropping your old table and renaming the tmp table your old tables name.
So, Can anyone quickly break down the very short list of how we can do a guaranteed order'd running total update to a table?
There's a treasure load of method's that could be used in many other situations that have been posted here...I'll definitely be testing them in other area's, especially the one that does the calculation in XML, that was sweet.
February 15, 2008 at 1:38 am
Well it certainly started with a great article, handling a fairly common issue 😎
And it resulted in a solution that worked a lightning speed and performed
the actual update in a single statement run. :smooooth:
And despite many efforts, re-confirmed the fact that there is no order in a set-based approach !
But the challenge has been set ! ➡ Provide an order-based solution that beats lightning :w00t:
Johan
Learn to play, play to learn !
Dont drive faster than your guardian angel can fly ...
but keeping both feet on the ground wont get you anywhere :w00t:
- How to post Performance Problems
- How to post data/code to get the best help[/url]
- How to prevent a sore throat after hours of presenting ppt
press F1 for solution, press shift+F1 for urgent solution 😀
Need a bit of Powershell? How about this
Who am I ? Sometimes this is me but most of the time this is me
February 15, 2008 at 1:56 am
ALZDBA (2/15/2008)
Well it certainly started with a great article, handling a fairly common issue 😎And it resulted in a solution that worked a lightning speed and performed
the actual update in a single statement run. :smooooth:
And despite many efforts, re-confirmed the fact that there is no order in a set-based approach !
But the challenge has been set ! ➡ Provide an order-based solution that beats lightning :w00t:
:hehe: I know some one can figure it out. Not me this time. :crying:
February 15, 2008 at 5:50 am
>>Well...There's got to be a way to use the CTE and ROW_NUMBER() and guarantee an ordered update...
Sorry, but there doesn't have to be. Just because we really wish something to be true, and logically KNOW something to be true doesn't make it so. :hehe:
Review this thread and you will see a post I made long ago that mentions Itzik Ben-Gan's crusade (joined by many others) requesting Microsoft to fully implement the OVER clause that WOULD provide a true non-triangular-join set-based solution to this problem.
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
February 15, 2008 at 7:45 am
ALZDBA (2/15/2008)
Well it certainly started with a great article, handling a fairly common issue 😎And it resulted in a solution that worked a lightning speed and performed
the actual update in a single statement run. :smooooth:
And despite many efforts, re-confirmed the fact that there is no order in a set-based approach !
But the challenge has been set ! ➡ Provide an order-based solution that beats lightning :w00t:
Great summary, Johan! And, thanks for the kudo!
--Jeff Moden
Change is inevitable... Change for the better is not.
February 15, 2008 at 8:57 am
TheSQLGuru (2/15/2008)
>>Well...There's got to be a way to use the CTE and ROW_NUMBER() and guarantee an ordered update...Sorry, but there doesn't have to be. Just because we really wish something to be true, and logically KNOW something to be true doesn't make it so. :hehe:
Review this thread and you will see a post I made long ago that mentions Itzik Ben-Gan's crusade (joined by many others) requesting Microsoft to fully implement the OVER clause that WOULD provide a true non-triangular-join set-based solution to this problem.
I surrender....for now. 😀
February 15, 2008 at 2:36 pm
OK, some fun on a slow Friday afternoon: this can indeed be implemented so that it is very, very quick without breaking any of the rules/guarantees provided by SQL.
The key is to recognize that CTE's defined as recursive queries are in fact a set-oriented tail recursion operator for SQL. This is an extremely powerful concept, and thus CTEs can be employed to do all sorts of wonderful things in a fast set-oriented manner.
Here's my code:
The source table create and data load; this takes 2:04 (mins:secs) on my machine (pretty much lifted from the original article):
CREATE TABLE dbo.JBMTest
( RowNum INT IDENTITY (1,1) NOT NULL,
AccountID INT not NULL,
Amount MONEY not NULL,
Date DATETIME not NULL
--RunBal MONEY NULL,
--GrpBal MONEY NULL,
--RunCnt INT NULL,
--GrpCnt INT NULL
)
CREATE NONCLUSTERED INDEX IX_JBMTest_AccountID_Date --not clustered for "Merry-go-Round" test
ON dbo.JBMTest (AccountID, Date)
-- 2mins, 4secs on my machine:
--===== Build the table 100 rows at a time to "mix things up"
DECLARE @Counter INT
SET @Counter = 0
WHILE @Counter < 1000000
BEGIN --===== Add 1000 rows to the test table
INSERT INTO dbo.JBMTest
(AccountID, Amount, Date)
SELECT TOP 100
AccountID = ABS(CHECKSUM(NEWID()))%50000+1
, Amount = CAST(CHECKSUM(NEWID())%10000 /100.0 AS MONEY)
, Date = CAST(RAND(CHECKSUM(NEWID()))*3653.0+36524.0 AS DATETIME)
FROM Master.dbo.SysColumns t1
CROSS JOIN Master.dbo.SysColumns t2
--===== Increment the counter
SET @Counter = @Counter + 100
END
Now the tricky CTE; we first build a table variable that sucks the whole source table out and applies a clustered index, then do the CTE query that adds the running totals:
declare @sequential as table
(
AccountID INT not NULL
, seq int not null
, Date DATETIME not NULL
, Amount MONEY not NULL
, unique clustered (AccountID, seq)
);
insert into @sequential
select
AccountID
, row_number() over(partition by AccountID order by Date) as seq
, Date
, Amount
from JBMTest;
with
summed(AccountID, seq, Date, Ammount, Running) as
(
select
sq.AccountID
, sq.seq
, sq.Date
, sq.Amount
, sq.Amount as Running
from @sequential as sq
where sq.seq = 1
UNION ALL
select
sq.AccountID
, sq.seq
, sq.Date
, sq.Amount
, sq.Amount + pri.Running as Running
from @sequential as sq
join summed as pri on
sq.AccountID = pri.AccountID
and sq.seq = pri.seq + 1
where sq.seq > 1
)
select * from summed;
The above query -- INCLUDING generating the table variable -- takes 29 seconds on my machine, and is eligible for full parallelism, etc. (Note also that I did not specify any table/index hints.) If we add a final sort (order by accountID, seq) (most likely in practice, if used for reporting), the time balloons to all of 43 seconds.....
Reviewing the query plans shows that each row is visited exactly once; the recursive CTE is trully a tail-recursion operator in SQL!
Out of curiousity, if we were to grab the current balance using a sum...over(partition) construct:
select distinct(AccountID), sum(Amount) over(partition by AccountID) as cAmount
from JBMTest
that takes 2 seconds on my machine.
-frank
February 15, 2008 at 3:35 pm
Sir Slicendice (2/15/2008)
OK, some fun on a slow Friday afternoon: this can indeed be implemented so that it is very, very quick without breaking any of the rules/guarantees provided by SQL.The key is to recognize that CTE's defined as recursive queries are in fact a set-oriented tail recursion operator for SQL. This is an extremely powerful concept, and thus CTEs can be employed to do all sorts of wonderful things in a fast set-oriented manner.
---Clipped for Size---
-frank
So I says to myself, bah, table variable...you can do it without a table variable!
... for an interesting exercise, please run this...or better yet, don't run this unless you want to lock your machine for a bit...
WITH Base
AS
(
select
ROW_NUMBER() OVER(PARTITION BY sq.AccountID ORDER BY sq.Date) as seq
,AccountID
,Date
,Amount
,Amount as Running
FROM dbo.JBMTest as sq
),
Summed (AccountID, seq, Date, Ammount, Running)
AS
(
select
sq.AccountID
,sq.seq
,sq.Date
,sq.Amount
,sq.Amount as Running
FROM Base as sq
where sq.seq = 1
UNION ALL
select
sq.AccountID
, sq.seq
, sq.Date
, sq.Amount
, sq.Amount + pri.Running as Running
from Base as sq
join summed as pri on
sq.AccountID = pri.AccountID
and sq.seq = pri.seq + 1
where sq.seq > 1)
select * from summed;
Lesson Learned:
Don't do this with a CTE referencing a CTE to do a Recursive Tail-End Query based on a sequence generated by row_number in the first CTE...Yeah...That's not a good thing.
Frank, Awesome example! We are getting soooo close. =)
I tried to get the recursive thing to work yesterday but couldn't do it based on the data in the table. I didn't think about shoving it into a table variable and generating the sequence.
Course...if you do this on 10 million rows, I'm sure noone else wants to shove it into a table variable either. =)
February 15, 2008 at 3:56 pm
Yup, that's what I tried first, too. Seems the optimizer is not smart enough to recognize that row number partitioned by accountid combined with accountid is a primary key - forcing the issue wuth a table var with a clustered index fixes that...
February 15, 2008 at 5:38 pm
Thanks for joining the thread, Frank... just curious... you posted the following... and said it takes 2 seconds on your machine...
Out of curiousity, if we were to grab the current balance using a sum...over(partition) construct:
select distinct(AccountID), sum(Amount) over(partition by AccountID) as cAmountfrom JBMTest
that takes 2 seconds on my machine.
What do you get for an output on the message tab if you run the following?
SET STATISTICS TIME ON
--===== Frank's code for current balance of all accounts
select distinct(AccountID), sum(Amount) over(partition by AccountID) as cAmount from JBMTest
--===== Standard Group By for current balance of all accounts
SELECT AccountID, SUM(Amount) AS cAmount
FROM dbo.JBMTest
GROUP BY AccountID
SET STATISTICS TIME OFF
Here's what I get...
[font="Courier New"]SQL Server Execution Times:
CPU time = 10906 ms, elapsed time = 15019 ms.
SQL Server Execution Times:
CPU time = 985 ms, elapsed time = 5709 ms.[/font]
I'm thinking that I wouldn't ever use SUM() OVER just to do a simple aggregate. If microsoft ever gets SUM() OVER to work with an ORDER BY (ie. SUM(AMOUNT) OVER (Partition by AccountID, ORDER BY Date)), then we'll be cooking with gas and such "parlor tricks" as what I've shown will no longer be necessary. 😛
Reviewing the query plans shows that each row is visited exactly once; the recursive CTE is trully a tail-recursion operator in SQL!
I've just gotta disagree, Frank... Recurrsion sure doesn't look set based in any actual execution plan I've ever seen including the one for your code. Take a look at the single row loops in the actual execution plan... I've gotta say that means that recurrsion actually qualifies for the name of "Hidden RBAR". Here's the code to show what I mean... also, timings are available on the message tab if you actually run the code... the code is in 3 sections... (1) the table build itself, (2) the update method including a copy to a temp table just in case we can't actually update the source table, and (3) your method with the recurrsive CTE...
... the update method, even when hampered by having to copy every to a temp table, still runs in about a 3rd of the time that the recurrsive CTE code takes...
SET STATISTICS TIME OFF
--===== Create and populate a 1,000,000 row test table 100 rows at a time to
-- simulate inserting multiple sets of rows.
-- Column "RowNum" is an IDENTITY and has a range of 1 to 1,000,000 unique numbers
-- Column "AccountID" has a range of 1 to 50,000 non-unique numbers
-- Column "Amount has a range of 0.0000 to +/- 99.9900 non-unique numbers
-- Column "Date" has a range of >=01/01/2000 and <01/01/2010 non-unique date/times
--===== If the test table already exists, drop it
SET NOCOUNT ON
IF OBJECT_ID('dbo.JBMTest','U') IS NOT NULL
DROP TABLE dbo.JBMTest
GO
--===== Create the test table
CREATE TABLE dbo.JBMTest
(
RowNum INT IDENTITY (1,1) NOT NULL,
AccountID INT NULL,
Amount MONEY NULL,
Date DATETIME NULL,
RunBal MONEY NULL,
GrpBal MONEY NULL,
RunCnt INT NULL,
GrpCnt INT NULL
)
CREATE NONCLUSTERED INDEX IX_JBMTest_AccountID_Date --not clustered for "Merry-go-Round" test
ON dbo.JBMTest (AccountID, Date)
--===== Build the table 100 rows at a time to "mix things up"
DECLARE @Counter INT
SET @Counter = 0
WHILE @Counter < 1000000
BEGIN
--===== Add 1000 rows to the test table
INSERT INTO dbo.JBMTest
(AccountID, Amount, Date)
SELECT TOP 100000
AccountID = ABS(CHECKSUM(NEWID()))%50000+1,
Amount = CAST(CHECKSUM(NEWID())%10000 /100.0 AS MONEY),
Date = CAST(RAND(CHECKSUM(NEWID()))*3653.0+36524.0 AS DATETIME)
FROM Master.dbo.SysColumns t1
CROSS JOIN Master.dbo.SysColumns t2
--===== Increment the counter
SET @Counter = @Counter + 100000
END
GO
------------------------------------------------------------------------------------------
PRINT REPLICATE ('=',70)
PRINT 'The UPDATE method... includes temp table creation...'
drop table #runningeverything
SET STATISTICS TIME ON
CREATE TABLE #RunningEverything
(
RowNum INT,
AccountID INT NOT NULL,
Date DATETIME NOT NULL,
Amount MONEY NOT NULL,
RunBal MONEY,
GrpBal MONEY,
RunCnt INT ,
GrpCnt INT,
PRIMARY KEY CLUSTERED (AccountID,DATE,Amount,RowNum))
INSERT INTO #RunningEverything
(RowNum,AccountID,Date,Amount)
SELECT CAST(RowNum AS INT), --Cast is to overcome identity in jbmtest
AccountID,Date,Amount
FROM dbo.jbmTest
--===== Declare the variables for the "Code Basis"
DECLARE @PrevRunBal MONEY --Overall running total
SET @PrevRunBal = 0
DECLARE @PrevGrpBal MONEY --Running total resets when account changes
SET @PrevGrpBal = 0
DECLARE @PrevRunCnt INT --Overall running count (ordinal rank)
SET @PrevRunCnt = 0
DECLARE @PrevGrpCnt INT --Running count resets when account changes
SET @PrevGrpCnt = 0
DECLARE @PrevAcctID INT --The "anchor" and "account change detector"
SET @PrevAcctID = 0
--===== Solve 2 types of Running Total and 2 types of Running Count problems
-- using a single update based on a Clustered Index at VERY high speeds.
UPDATE #RunningEverything
SET --===== Running Total
@PrevRunBal = RunBal = @PrevRunBal + Amount,
--===== Grouped Running Total (Reset when account changes)
@PrevGrpBal = GrpBal = CASE
WHEN AccountID = @PrevAcctID
THEN @PrevGrpBal + Amount
ELSE Amount -- Restarts total at "0 + current amount"
END,
--===== Running Count (Ordinal Rank)
@PrevRunCnt = RunCnt = @PrevRunCnt + 1,
--===== Grouped Running Total (Ordinal Rank, Reset when account changes)
@PrevGrpCnt = GrpCnt = CASE
WHEN AccountID = @PrevAcctID
THEN @PrevGrpCnt + 1
ELSE 1 -- Restarts count at "1"
END,
--===== "Anchor" and provides for "account change detection"
@PrevAcctID = AccountID
FROM #RunningEverything WITH (TABLOCKX)
--===== Display the results in the correct order
SELECT *
FROM #RunningEverything
ORDER BY AccountID, Date
SET STATISTICS TIME OFF
--------------------------------------------------------------------------------------------
GO
PRINT REPLICATE ('=',70)
PRINT 'The Recursive method... includes table variable creation...'
-------------------------------------------------------------------------------------------------------
SET STATISTICS TIME ON
declare @sequential as table
(
AccountID INT not NULL
, seq int not null
, Date DATETIME not NULL
, Amount MONEY not NULL
, unique clustered (AccountID, seq)
);
insert into @sequential
select
AccountID
, row_number() over(partition by AccountID order by Date) as seq
, Date
, Amount
from JBMTest;
with
summed(AccountID, seq, Date, Ammount, Running) as
(
select
sq.AccountID
, sq.seq
, sq.Date
, sq.Amount
, sq.Amount as Running
from @sequential as sq
where sq.seq = 1
UNION ALL
select
sq.AccountID
, sq.seq
, sq.Date
, sq.Amount
, sq.Amount + pri.Running as Running
from @sequential as sq
join summed as pri on
sq.AccountID = pri.AccountID
and sq.seq = pri.seq + 1
where sq.seq > 1
)
select * from summed ORDER BY AccountID,Date;
SET STATISTICS TIME OFF
I don't believe I'll be using recurrsion for running balances, but I sure do appreciate the time you spent to write and submit code for testing! Thanks, Frank! :w00t:
For those that, for some reason, cannot run the code above, here's the output from the message tab of the code...
=====================================================================
The UPDATE method... includes temp table creation...
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 51 ms.
SQL Server Execution Times:
CPU time = 17079 ms, elapsed time = 22142 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
SQL Server Execution Times:
CPU time = 8125 ms, elapsed time = 9912 ms.
SQL Server Execution Times:
CPU time = 1515 ms, elapsed time = 38744 ms.
=====================================================================
The Recursive method... includes table variable creation...
SQL Server Execution Times:
CPU time = 17344 ms, elapsed time = 22548 ms.
SQL Server Execution Times:
CPU time = 55109 ms, elapsed time = 95902 ms.
--Jeff Moden
Change is inevitable... Change for the better is not.
February 15, 2008 at 5:43 pm
You know...with very little effort, Frank's query could easily be adapted for a self referencing table to do a rolling update in a guaranteed order for each child under a parent. As well as guarantee order.
You would do it a level at a time instead of blasting through the whole table, but it should be very quick....and handy if you have tables that are self-referencing.
February 15, 2008 at 6:09 pm
Christopher Ford (2/15/2008)
So I says to myself, bah, table variable...you can do it without a table variable!
... for an interesting exercise, please run this...or better yet, don't run this unless you want to lock your machine for a bit...
Sorry, Chris... after 12 minutes, it had only done 50,000 rows... I cancelled the query.
Heh... I don't mean to be too much of a smarta$$ here, but they called it "Recursion" so it would show up in the dictionary right after "RBAR" and "Real Slow". :laugh: I don't believe anything recursive will beat the direct update method although some pretty cool code is bubbling to the surface in the attempts. Thanks guys. 🙂
Anyone wanna try a CLR? 😀
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 121 through 135 (of 250 total)
You must be logged in to reply to this topic. Login to reply