Solving the "Running Total" & "Ordinal Rank" Problems in SS 2k/2k5

Question

Solving the "Running Total" & "Ordinal Rank" Problems in SS 2k/2k5

Viewing 15 posts - 136 through 150 (of 250 total)

You must be logged in to reply to this topic. Login to reply

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 1

Jeff Moden (2/15/2008)

Christopher Ford (2/15/2008)
So I says to myself, bah, table variable...you can do it without a table variable!
... for an interesting exercise, please run this...or better yet, don't run this unless you want to lock your machine for a bit...
Sorry, Chris... after 12 minutes, it had only done 50,000 rows... I cancelled the query.
Heh... I don't mean to be too much of a smarta$$ here, but they called it "Recursion" so it would show up in the dictionary right after "RBAR" and "Real Slow". :laugh: I don't believe anything recursive will beat the direct update method although some pretty cool code is bubbling to the surface in the attempts. Thanks guys. 🙂
Anyone wanna try a CLR? 😀

Actually - Adam (Machanic) figured that the CLR might be a good solution, since a. it guarantees order and b. it doesn't necessarily require RBAR (you can make the changes in memory and push them to disk). I've been kicking it around in the old peashooter up top - not quite there yet.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Jeff Moden SSC Guru Points: 1004419 More actions · Answer 2

Heh... I knew it would be YOU, Matt... after all the testing you did for/with me on the "other" CLR's, I should have just asked "Hey Matt... you wanna try writing a CLR for this?" 🙂

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 3

I suppose you're going to tell me I'm getting to be predictable now.... :hehe:

There's the obvious way, emulating a SQL-style cursor, which seems oh so very obvious. But the point is - .NET is so much smarter at using loops (as in - process and store the results in memory and start streaming them back), it would just be a shame not to give that a whirl and see if we can't get the old cursor method a boot in the pants. I still don't think there's a shot in hell it will beat the update method, but an honorable second... I'm not counting that out yet.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 4

Boy, I go off to a CodeCamp dinner and come back and BAM - 15 more entries on this thread! 🙂

2 points

1) SELECT ... OVER(ORDER BY...) does guarantee the order by in the OVER but does NOT IIRC guarantee the order of the SELECT output.

2) use of a table variable voids the use of parallelism which can significantly reduce performance for large dataset processing.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

Sir Slicendice Mr or Mrs. 500 Points: 570 More actions · Answer 5

1) SELECT ... OVER(ORDER BY...) does guarantee the order by in the OVER but does NOT IIRC guarantee the order of the SELECT output.

If you're refering to the CTE approach I posted, the order returned is irrelevant; the ordering is captured and preserved in the seq column in the table var: this is completely legit SQL, fully supported, no special circumstances required. The recursive CTE expression processes the table var a 'level' at a time to build the result table for the enclosing query.

2) use of a table variable voids the use of parallelism which can significantly reduce performance for large dataset processing.

You are of course correct; I shouldn't have mentioned the parallel bit. However, this restriction may at some point be lifted (no language reasons why it couldn't be, it's just an engine implementation detail). Also, with the way SQL defines handling of table vars, it is entirely possible that further enhancements of the query engine could result in the var only being populated on demand, at first use -- and in this sort of query, a (better) optimizer could thus completely eliminate the table variable....

-frank

The End.

Sir Slicendice Mr or Mrs. 500 Points: 570 More actions · Answer 6

Jeff,

Sorry, I should have posted the group by time as well; it's about the same as yours, 800msec or so. I agree that using the partitioned sum as a group by is a loss; I posted it as I thought it was an interesting coding construct (combining distinct() with sum()) and interesting in that the optimizer did not detect the group by emulation.

As far as recursion, I will try to look at the situation a little closer; I am interested in the differences between the plans for the table var approach (fast) versus the all-in-one CTE approach (slow triangular join). I only glanced at the plans, but I did note that the table var approach only processes the same number of rows as the base table (each row processed once), while the all-CTE approach processes the full triangle. (Row counts in the plan steps indicate this.) I'm going to go way out on a limb (as I haven't looked at this enough to really say this), but I suspect that the plan may not really be showing enough detail to identify the full sophistication of the table var approach. I do plan on looking into this more.

The thing to note is that while this is a recursive query in the CTE, it is tail recursive which can (in theory, and seemingly in practice here) be fully optimized into a loop. The code does look like a triangluar join, and if you look at the code closely you may even wonder why the results are not the full triangular result set. However, this is the magic of the recursive CTE: a recursive CTE is a totally new SQL construct, and it cannot be emulated with the obvious while loop around the same query! In this specific case, we see that when the optimizer has access to an appropriate index, it will indeed turn the triangular join into a loop, and thus be very, very fast.

The only unfortunate aspect to me is that you need the table variable here, with the clustered index, in order to get the optimizer to make the right choice; it would be vastly preferable if this could be done as two CTE tables (the table var replaced with a non-recursive query) in the WITH cte block (in a single statement query). But that's a relatively minor point here.... To me, this is a huge win in that we are limiting ourselves to fully supported language constructs and we don't have to modify source tables to make this fast: further improvements to the optimizer and query engine will only make this approach even faster, they won't break this....

Of course, as with everything in this area, it is possible that something changes somewhere and the optimizer will suddenly decide to not optimize the recursive join into a loop; that's always a possibility, but seems very unlikely here. However, the result set cannot change; our only risk with this approach is that the optimizer may cause the query to run a lot slower. And while the speed difference would be huge (ginormous even), I think it is highly significant that we do not need to worry about the result set changing. Speed problems we generally live to fix another day; accuracy errors we may not survive....

-frank

The End.

Jeff Moden SSC Guru Points: 1004419 More actions · Answer 7

TheSQLGuru (2/15/2008)

2) use of a table variable voids the use of parallelism which can significantly reduce performance for large dataset processing./quote]

Kevin, do you, by any chance, have a Microsoft article that states that? Seems like I've been in a huge number of "arguments" about whether to use Temp Tables or Table Variables and, as part of the "It Depends" answer, I wouldn't mind having that particular piece of information as a verifiable source.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004419 More actions · Answer 8

Frank,

Heh... understood on things possibly changing...

It's a bit funny... for years, people jumped on me for using the occasional undocumented feature (xp_DirTree to be specific) because "Microsoft could change those at any time without notice". Then, on 2k, they changed the required privs for a documented feature called sp_MakeWebTask that blew a lot of people's code out of the water. 2k had some great documented features like the {f4} key. That very nice feature was removed in 2k5. The list goes on but you get the idea. Dunno what the hell they're going to mess up in 2k8 😀

The really ironic part of all this is that the "undocumented feature" that I used way back when, still hasn't changed. Go figure, huh? 😛

Anyway, thanks for all of your feedback on this thread.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004419 More actions · Answer 9

Matt Miller (2/15/2008)
I suppose you're going to tell me I'm getting to be predictable now.... :hehe:
There's the obvious way, emulating a SQL-style cursor, which seems oh so very obvious. But the point is - .NET is so much smarter at using loops (as in - process and store the results in memory and start streaming them back), it would just be a shame not to give that a whirl and see if we can't get the old cursor method a boot in the pants. I still don't think there's a shot in hell it will beat the update method, but an honorable second... I'm not counting that out yet.

Predictable? I guess you could call it that... I was thinking more like "reliable" 🙂

I agree... I don't believe that anything, including the best written CLR, will ever be able to beat the update method, but I do think it can be written to be as fast as a firehose Cursor:w00t:... heh, of course, as soon as I say that, someone will find a way (probably you). 😀

I'm not sure, if I were you, that I'd waste the time writing the CLR for any other reason other than to have an information point where one could say, "No, we already tried that and it wasn't as fast". 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 10

Jeff Moden (2/16/2008)

TheSQLGuru (2/15/2008)
2) use of a table variable voids the use of parallelism which can significantly reduce performance for large dataset processing./quote]
Kevin, do you, by any chance, have a Microsoft article that states that? Seems like I've been in a huge number of "arguments" about whether to use Temp Tables or Table Variables and, as part of the "It Depends" answer, I wouldn't mind having that particular piece of information as a verifiable source.

2005 BOL: ms-help://MS.SQLCC.v9/MS.SQLSVR.v9.en/tsqlref9/html/1ef0b60e-a64c-4e97-847b-67930e3973ef.htm

Important:

Queries that modify table variables do not generate parallel query execution plans. Performance can be affected when very large table variables, or table variables in complex queries, are modified. In these situations, consider using temporary tables instead. For more information, see CREATE TABLE (Transact-SQL). Queries that read table variables without modifying them can still be parallelized.

So the insert to populate it can't be parallelized but select access can.

I have moved several clients away from extensive use of table variables. They have either large datasets or poorly distributed data values (or both). These can lead to exceedingly poor query plans due to either no parallel on populate or lacking statistics leading to suboptimal query plans.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 11

Off Topic

Hey Matt - I got your remote "Hello" at the Raleigh Code Camp today. Some guy came up to me and said "Hey, your TheSQLGuru (had that on my name tag 😀 ). Matt says to tell you Hello". Boy did you miss a good event too. We had a televised luncheon (for dnrTV), lots of swag, great topics (SIX tracks!) and an unbelivable 4 Microsoft RDs and 12 MVPs among the presenters.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

Jeff Moden SSC Guru Points: 1004419 More actions · Answer 12

Heh... dang... thanks Kevin! I should'a looked in BOL...

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 13

TheSQLGuru (2/16/2008)
Off Topic
Hey Matt - I got your remote "Hello" at the Raleigh Code Camp today. Some guy came up to me and said "Hey, your TheSQLGuru (had that on my name tag 😀 ). Matt says to tell you Hello". Boy did you miss a good event too. We had a televised luncheon (for dnrTV), lots of swag, great topics (SIX tracks!) and an unbelivable 4 Microsoft RDs and 12 MVPs among the presenters.

Bahahaahahahaa! That's my recruiter, who I have been reintroducing to .NET and SQL Server. That guy has a pair of steel-tipped ones - he can just walk up to anyone and start up a conversation. He had asked whose presentations I thought he should stop in on, so I told him those who I wanted to see (a couple of yours, Andy's (on SSIS), Miguel's (on what the future of contracting looks like) and Diane's (on LinQ)). Your internet handle came up....

It was a REALLY hard one to pass up....

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 14

Jeff Moden (2/16/2008)
Matt Miller (2/15/2008)
I suppose you're going to tell me I'm getting to be predictable now.... :hehe:
There's the obvious way, emulating a SQL-style cursor, which seems oh so very obvious. But the point is - .NET is so much smarter at using loops (as in - process and store the results in memory and start streaming them back), it would just be a shame not to give that a whirl and see if we can't get the old cursor method a boot in the pants. I still don't think there's a shot in hell it will beat the update method, but an honorable second... I'm not counting that out yet.
Predictable? I guess you could call it that... I was thinking more like "reliable" 🙂
I agree... I don't believe that anything, including the best written CLR, will ever be able to beat the update method, but I do think it can be written to be as fast as a firehose Cursor:w00t:... heh, of course, as soon as I say that, someone will find a way (probably you). 😀
I'm not sure, if I were you, that I'd waste the time writing the CLR for any other reason other than to have an information point where one could say, "No, we already tried that and it wasn't as fast". 😉

Well - it might just be egg on the face time....for us both. I just built a running totals CLR Table-valued function.

What if I told you it executes in 13 seconds? No temp table needed either - it only requires a covering non-clustered index... Meaning - if you factor in building a temp table, etc... it might actually be FASTER than the update process. Of course - this all happens in memory, so I'm thinking it's murder on memory.

I knew it could loop with A$$$$ off, but wow. I didn't have enough faith in it...

I need to go dig out your solution and set these up against each other. Once I do - I will post the CLR code and the results.

It's totally not reusable right now (meaning - the column names and table name are hardcoded, so I can't just pass it parameters and have it just give me the stuff), but - nothing says that it can't be improved on.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 15

Okay - results time....:)

First - the test scenario setup. Since the "update" method requires a new table and a clustered index to do its thing, it's necessary to throw in the creation of the temp table and clustered index. In the same way - since the CLR function requires an index on the "base" table, the time required to build this will be counted "against" the CLR function.

The Base table is built based on the "Merry Go round" scenario, and is used against both scenarios.

--===== Create and populate a 1,000,000 row test table 100 rows at a time to

-- simulate inserting multiple sets of rows.

-- Column "RowNum" is an IDENTITY and has a range of 1 to 1,000,000 unique numbers

-- Column "AccountID" has a range of 1 to 50,000 non-unique numbers

-- Column "Amount has a range of 0.0000 to +/- 99.9900 non-unique numbers

-- Column "Date" has a range of >=01/01/2000 and <01/01/2010 non-unique date/times

--===== If the test table already exists, drop it

SET NOCOUNT ON

IF OBJECT_ID('dbo.JBMTestMerry','U') IS NOT NULL

DROP TABLE dbo.JBMTestMerry

GO

--===== Create the test table

CREATE TABLE dbo.JBMTestMerry

(

RowNum INT IDENTITY (1,1) NOT NULL Primary Key Clustered,

AccountID INT NULL,

Amount MONEY NULL,

Date DATETIME NULL,

RunBal MONEY NULL,

GrpBal MONEY NULL,

RunCnt INT NULL,

GrpCnt INT NULL

)

--===== Add the "sorting index" to the table

CREATE NONCLUSTERED INDEX IX_JBMTestMerry_AccountID_Date --not clustered for "Merry-go-Round" test

ON dbo.JBMTestMerry (AccountID, Date)

--===== Build the table 100 rows at a time to "mix things up"

DECLARE @Counter INT

SET @Counter = 0

WHILE @Counter < 1000000

BEGIN

--===== Add 1000 rows to the test table

INSERT INTO dbo.JBMTestMerry

(AccountID, Amount, Date)

SELECT TOP 100

AccountID = ABS(CHECKSUM(NEWID()))%50000+1,

Amount = CAST(CHECKSUM(NEWID())%10000 /100.0 AS MONEY),

Date = CAST(RAND(CHECKSUM(NEWID()))*3653.0+36524.0 AS DATETIME)

FROM Master.dbo.SysColumns t1

CROSS JOIN Master.dbo.SysColumns t2

--===== Increment the counter

SET @Counter = @Counter + 100

END

The CLR function:

Imports System

Imports System.Data

Imports System.Data.SqlClient

Imports System.Data.SqlTypes

Imports Microsoft.SqlServer.Server

Imports System.Runtime.InteropServices

Partial Public Class UserDefinedFunctions

Private Class jbmrunbal

Private _rowID As SqlInt32

Private _accountid As SqlInt32

Private _date As DateTime

Private _amt As SqlMoney

Private _rb As SqlMoney

Public Property RowID() As SqlInt32

Get

Return _rowID

End Get

Set(ByVal value As SqlInt32)

_rowID = value

End Set

End Property

Public Property AccountID() As SqlInt32

Get

Return _accountid

End Get

Set(ByVal value As SQLINT32)

_accountid = value

End Set

End Property

Public Property DateCol() As DateTime

Get

return _date

End Get

Set(ByVal value As DateTime)

_date=value

End Set

End Property

Public Property amount() As SqlMoney

Get

Return _amt

End Get

Set(ByVal value As SqlMoney)

_amt = value

End Set

End Property

Public Property running() As SqlMoney

Get

Return _rb

End Get

Set(ByVal value As SqlMoney)

_rb = value

End Set

End Property

End Class

<Microsoft.SqlServer.Server.SqlFunction(FillRowMethodName:="FillRowRunTot",DataAccess:=DataAccessKind.Read, IsDeterministic:=True, IsPrecise:=True,TableDefinition:="RowNum int, AccountID int,date datetime,amount money,runbal money" )> _

Public Shared Function RunningTotal() As IEnumerable

Dim cmd As SqlCommand

Dim rdr As SqlDataReader

Dim jm As New Collection

Dim jbmr As jbmrunbal

Dim prevacct As SQLINT32 = 0

Dim runbal As SQLMoney = 0

Using conn As New SqlConnection("context connection=true")

conn.Open()

cmd = New SqlCommand("Select rownum,accountid,date,amount from jbmtestmerry ORDER by accountid,date", conn)

rdr = cmd.ExecuteReader()

While (rdr.Read())

jbmr = New jbmrunbal

jbmr.RowID = rdr.GetInt32(0)

jbmr.AccountID =rdr.GetInt32(1)

jbmr.DateCol = rdr.GetDateTime(2)

jbmr.amount = rdr.GetSqlMoney(3)

If prevacct = jbmr.AccountID Then

runbal = runbal + jbmr.amount

Else

runbal = jbmr.amount

End If

jbmr.running = runbal

prevacct = jbmr.AccountID

jm.Add(jbmr)

End While

End Using

Return jm

End Function

Public Shared Sub FillRowRunTot(ByVal obj As Object, <Out()> ByRef RowID As SqlInt32, <Out()> ByRef AccountID As SqlInt32, <Out()> ByRef DateCol As SqlDatetime, <Out()> ByRef Amount As SqlMoney, <Out()> ByRef Runbal As SqlMoney)

Dim j As jbmrunbal

j=ctype(obj,jbmrunbal)

With j

RowID = .RowID

AccountID = .AccountID

DateCol = .DateCol

Amount = .amount

Runbal = .running

End With

End Sub

End Class

So - this leads to the test script:

drop table jbmtest

declare @g datetime

--start the clock

select @g=getdate()

select * Into JBMTEST from jbmtestMerry

Create unique clustered index uci_jbmtest on jbmtest(accountID,date)

select 'Update method',datediff(ms,@g,getdate())

go

--===== Declare the variables for the "Code Basis"

declare @g datetime

select @g=getdate()

DECLARE @PrevRunBal MONEY

SET @PrevRunBal = 0

DECLARE @PrevAcctID INT

SET @PrevAcctID = 0

--===== Apply what we know to the "Code Basis"

UPDATE dbo.jbmtest

SET @PrevRunBal = RunBal = CASE

WHEN AccountID = @PrevAcctID

THEN @PrevRunBal + Amount

ELSE Amount -- Restarts total at "0 + current amount"

END,

@PrevAcctID = AccountID

FROM dbo.JBMTest WITH (INDEX(uci_jbmtest),TABLOCKX)

select 'Update method',datediff(ms,@g,getdate())

--===== Display the results in the correct order

SELECT top 100 *

FROM dbo.JBMTest

ORDER BY AccountID, Date

go

--dbcc freeproccache

go

drop table memtest

go

declare @g datetime

--start the clock

select @g=getdate()

create index ix_jbmtestmerry on jbmtestmerry(accountid,date) include (amount)

Select * into memtest from

(select * from dbo.runningtotal()) t

drop index jbmtestmerry.ix_jbmtestmerry --drop the temporary index so as to not "leave anything behind"

select 'CLR method',datediff(ms,@g,getdate())

--Verification - makes sure that they match

SELECT top 100 *

FROM dbo.memTest

ORDER BY AccountID, Date

Just for a point of comparison - I also threw in a version of Adam Machanic's "Cursor version" to boot:

declare @G datetime

select @g=getdate()

create index ix_jbmtestmerry on jbmtestmerry(accountid,date) include (amount)

DECLARE RunningTotalCursor

CURSOR LOCAL FAST_FORWARD FOR

SELECT AccountID,date, Amount

FROM JBMTestMerry

ORDER BY AccountID,date

OPEN RunningTotalCursor

declare @prevaccount int

set @prevaccount=0

DECLARE @AccountID INT

Declare @date datetime

DECLARE @Amount MONEY

DECLARE @RunningTotal MONEY

SET @RunningTotal = 0

DECLARE @Results TABLE

(

AccountID INT NOT NULL,

date datetime,

Amount MONEY,

RunningTotal MONEY,

PRIMARY KEY(accountid,date)

)

FETCH NEXT FROM RunningTotalCursor

INTO @AccountID,@date, @Amount

WHILE @@FETCH_STATUS = 0

BEGIN

SET @RunningTotal = case when @prevaccount=@accountid then @RunningTotal else 0 end + @Amount

INSERT @Results

VALUES (@AccountID, @date, @Amount, @RunningTotal)

FETCH NEXT FROM RunningTotalCursor

INTO @AccountID,@date, @Amount

END

CLOSE RunningTotalCursor

DEALLOCATE RunningTotalCursor

drop index jbmtestmerry.ix_jbmtestmerry --drop the temporary index so as to not "leave anything behind"

select datediff(ms,@g,getdate())

SELECT Top 100 *

FROM @Results

ORDER BY AccountID,date

And now - with no further delay.... THE RESULTS (picked what looks to be the "middle of the road" results):

Update Function :6936 (create temp table and index) + 4060 (update) = 10996 ms

CLR: 15032 ms (add 1267 ms if you want a clustered index on the final temp table).

Adam Machanic's Cursor: 50196 ms

So - the "update" method is still safe in its first place during most runs. But CLR does seem to give it a run for its money, and does seem to guarantee Order, etc... It's actually close enough that once out of 20 runs, the CLR method actually did beat the update method once.

Both blows the doors off of the Cursor method.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?