Solving the Running Total and Ordinal Rank Problems (Rewritten)

  • Matt,

    Where did I say that a running total MUST be deterministic? (Your capitalization)

    It seems good practice to me that the 'order by' part of a ranking function should produce a deterministic order. I think I qualified my previous remarks well enough to cover cases where such ordering might not be required. Even so, I think I would still prefer a method which reliably assigns the same running totals to each row over one that might be different on subsequent runs. You may feel differently, and that's fine.

    The discussion originally centred around Hugo's posted method, which quite sensibly does use a deterministic order by. The criticism was that the method would fail if the order by produced duplicates. The ROW_NUMBER ranking function was proposed as being a simple fix in that hypothetical situation. That's the bit I disagree with.

    I would (slightly) prefer ROW_NUMBER over RANK myself - but always with a reliable order.

    Paul

  • Alexander Kuznetsov-291390 (11/16/2009)


    Matt Miller (#4) (11/16/2009)[

    Not that I care to get in the middle of a good brawl, but where is the presumption that a running total MUST be deterministic come from?

    I am not sure that a running total MUST be deterministic, but I have simply never seen otherwise. I worked quite a lot with cases like inventory (how much stuff do we have), or insurance (how much money counts towards the deductible), etc. In every case I have ever worked with, running totals were always deterministic.

    Can you give us a real life example when we care about running totals, and they can be non-deterministic?

    Sure - we used to get uploads of data from small computer store sales by day. The data was pretty basic, essentially store #, date (no time component), and type of sale and amount of the transaction (the rest was kept in the local store's POS and somehow couldn't be uploaded without messing with the programming, etc.). For some reason they enjoyed getting this in "running total form".

    In this case - you end up with 20-30 items in the same buckets, and the order WITHIN the subgroup was, well, not important.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Paul White (11/16/2009)


    Matt,

    Where did I say that a running total MUST be deterministic? (Your capitalization)

    It seems good practice to me that the 'order by' part of a ranking function should produce a deterministic order. I think I qualified my previous remarks well enough to cover cases where such ordering might not be required. Even so, I think I would still prefer a method which reliably assigns the same running totals to each row over one that might be different on subsequent runs. You may feel differently, and that's fine.

    The discussion originally centred around Hugo's posted method, which quite sensibly does use a deterministic order by. The criticism was that the method would fail if the order by produced duplicates. The ROW_NUMBER ranking function was proposed as being a simple fix in that hypothetical situation. That's the bit I disagree with.

    I would (slightly) prefer ROW_NUMBER over RANK myself - but always with a reliable order.

    Paul

    I understand. And I came off a bit more agressive than I wanted to (sorry about that).

    I think there's a disconnect. The previous poster wasn't seeing the ROW_NUMBER as a fix to the deterministic part at all: it's a fix to the "math" you would get if you happened to have less than ideal ordering. The running total itself wouldn't be right within a given run, never mind reproducible.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Matt Miller (#4) (11/17/2009)


    I understand. And I came off a bit more agressive than I wanted to (sorry about that).

    No worries 🙂

    Matt Miller (#4) (11/17/2009)


    I think there's a disconnect. The previous poster wasn't seeing the ROW_NUMBER as a fix to the deterministic part at all: it's a fix to the "math" you would get if you happened to have less than ideal ordering. The running total itself wouldn't be right within a given run, never mind reproducible.

    Oh I agree - but where would be the fun in being all reasonable about it? 😉

  • Jeff Moden (11/10/2009)


    Alexander Kuznetsov-291390 (11/10/2009)


    These are incredibly smart techniques, but after playing with such approaches I decided that I don't want to put it into production - they are just a little bit too complex and they do not look robust to me. In my book this is one of those cases where denormalization allows for much faster and significantly simpler solutions:

    BTW, Adam Machanic recommends CLR cursors in such cases...

    First, thanks for taking the time to stop by and post your thoughts, Alexander. I really appreciate it.

    I'm not sure why you find the "Quirky Update" a bit too complex nor why you don't think they look robust, but that's OK. Your good method will nicely support ongoing running balance updates on insertion of new data, albeit, in a RBAR manner. I'll add some performance testing using your good method to my list of things to do.

    To be sure, your method is very clever... I like it. I just think it will be a bit slow because of it's RBAR nature.

    And, yes, I know Adam recommends using a cursor for this... so do a lot of other good folks. If you don't trust the "Quirky Update" you can do as I suggested in the article... use verification code to verify it worked correctly or do like Adam and others suggest.... use a cursor and tolerate the performance and resource usage hit on large batches. 😉

    Hey Jeff!

    Truly awesome solution once again! The rewrite is even better than the original.

    Just wanted to pass this little redirection to you. ( sorry - couldn't pass up on the "CLR cursor".) The solution Adam M pushes as a "CLR cursor" really doesn't act or feel at all like a SQL cursor.

    Just for giggles - I took a stab at one:

    Imports System

    Imports System.Data

    Imports System.Data.SqlClient

    Imports System.Data.SqlTypes

    Imports Microsoft.SqlServer.Server

    Partial Public Class StoredProcedures

    <Microsoft.SqlServer.Server.SqlProcedure()> _

    Public Shared Sub RunningTotal()

    Dim currentacct As Long = 0

    Dim fun As Decimal = 0

    'use actual connection strings if you only want to display it/return as a recordset - it's faster

    Dim conn As SqlConnection = New SqlConnection("Data Source=(local);Initial Catalog=TempDB;Integrated Security=yes")

    'use the context connection if you want to dump to a table.

    'Dim conn As SqlConnection = New SqlConnection("context connection=yes")

    conn.Open()

    Dim cmd As SqlCommand = New SqlCommand("set fmtonly off;SET NOCOUNT ON;SELECT AccountID, Date, TransactionDetailID, Amount FROM TransDet with (nolock) ORDER BY AccountID, Date, TransactionDetailID", conn)

    Dim rdr As SqlDataReader = cmd.ExecuteReader()

    'set up the record layout

    Dim col1 As SqlMetaData = New SqlMetaData("AccountID", SqlDbType.Int)

    Dim col2 As SqlMetaData = New SqlMetaData("Date", SqlDbType.DateTime)

    Dim col3 As SqlMetaData = New SqlMetaData("TransactionDetailID", SqlDbType.Int)

    Dim col4 As SqlMetaData = New SqlMetaData("amount", SqlDbType.Decimal, 15, 4)

    Dim col5 As SqlMetaData = New SqlMetaData("ART", SqlDbType.Decimal, 15, 4)

    Dim rec As SqlDataRecord = New SqlDataRecord(New SqlMetaData() {col1, col2, col3, col4, col5})

    SqlContext.Pipe.SendResultsStart(rec) 'start the recordset

    While (rdr.Read())

    rec.SetInt32(0, CInt(rdr.Item(0)))

    rec.SetDateTime(1, CDate(rdr.Item(1)))

    rec.SetInt32(2, CInt(rdr.Item(2)))

    rec.SetDecimal(3, Math.Round(CDec(rdr.Item(3)), 4))

    If CLng(rdr.Item(0)) = currentacct Then

    fun = CDec(rdr.Item(3)) + fun

    Else

    fun = CDec(rdr.Item(3))

    End If

    rec.SetDecimal(4, Math.Round(fun, 4))

    currentacct = CLng(rdr.Item(0))

    SqlContext.Pipe.SendResultsRow(rec)

    End While

    SqlContext.Pipe.SendResultsEnd() 'let go of the recordset

    conn.Close()

    End Sub

    Yes - it's CLR, and processing happens one row at a time, however:

    - because it's streaming data through, it looks, feels and acts like a recordset.

    - it doesn't have to materialize in memory, so its memory footprint isn't bad.

    - it also doesn't require a clustered index (so it can be run from the original table)

    - it doesn't require a working table (though it can certainly dump its results to a table as it goes)

    And here's the kicker - the version I am posting creates the recordset in 14.5 sec ( versus the pseudo cursor's 3 second update). So - it's definitely not touching the pseudo-cursor when updating a table in-place, but it's close when you include building/rebuilding the table that holds the running total.

    I'm still using the pseudo-cursor method most of the time, but the CLR cursor can ALMOST make one change their mind about the infamous "c" word...:)

    Thanks again for a wildly performant solution.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Paul White (11/17/2009)


    Matt Miller (#4) (11/17/2009)


    I understand. And I came off a bit more agressive than I wanted to (sorry about that).

    No worries 🙂

    Matt Miller (#4) (11/17/2009)


    I think there's a disconnect. The previous poster wasn't seeing the ROW_NUMBER as a fix to the deterministic part at all: it's a fix to the "math" you would get if you happened to have less than ideal ordering. The running total itself wouldn't be right within a given run, never mind reproducible.

    Oh I agree - but where would be the fun in being all reasonable about it? 😉

    No fun whatsoever. Also, you are of questionable ancestry.

    But Matt is right. In the event that members of a particular sequence cannot or need not be reliably ordered, ROW_NUMBER ensures the integrity of the running total value, whereas RANK breaks the function. Somewhere I developed an inconvenient preference that my systems fail gracefully rather than catastrophically, thus my suggested fix.

    [Edit: 11/18/09] I thought of something else that might address all issues. Apply both RANK and ROW_NUMBER in the WHERE loop as separate columns. Use ROW_NUMBER to ensure the accuracy of the calculations, but test the value from ROW_NUMBER against the value obtained from RANK to flag any potential ordering issues.

    From SQL2005 and above, I wonder how a persisted computed column would perform. The function that populates the column would only need to find the most recent running total value and apply the transaction amount. In the event that no previous transactions exist, the transaction amount would serve as the running total value. Inserting a low number of rows should perform quite well. The downside is that inserts at the beginning of the series would cause a cascading re-calculation for the table. It seems to me that a good portion of the applications for this type of functionality might disallow back-dated transactions, so even if the performance sucks across the entire table, this might not be an issue.

  • Matt Miller (#4) (11/17/2009)


    Hey Jeff!

    Truly awesome solution once again! The rewrite is even better than the original.

    ...

    Just wanted to pass this little redirection to you. ( sorry - couldn't pass up on the "CLR cursor".) The solution Adam M pushes as a "CLR cursor" really doesn't act or feel at all like a SQL cursor.

    Ah... thanks for the clarification. I was still reminiscing about Adam's first article where he used a T-SQL explicit cursor to correctly beat the pants off a Triangular Join.

    And thanks for the feedback, Matt, and I don't just mean the great compliment. I take quite a slam from almost every direction for pushing an "undocumented" feature I happen to believe in and use... it's nice to get some feedback that isn't seething with fire and brimstone. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Martin Butcher-415798 (11/13/2009)


    Jeff Moden (11/12/2009)


    Ok, Martin... I'm confused... I see that all of your code examples update the "RunningBalance" column... none of the code updates the "RunningTotal" column yet your complaint is that the "RunningTotal" column goes astray according to your thoughts. What is it that updates the "RunningTotal" column?

    And, by the way, it's the "RunningTotal" column that went astray... not the "RunningBalance" column that (according to your code), that the "Quirky Update" correctly calculated. 😉

    Jeff I am not sure what you mean here?

    This is the code I posted and am complaining about. No where does it mention runningbalance.

    The running balance column already existed in my table and if its wrong there are going to be some very unhappy customers out there.

    SET NOCOUNT ON

    --===== Declare the working variables

    DECLARE @PrevAccountID INT

    DECLARE @AccountRunningTotal DECIMAL(14, 3)

    --

    --===== Update the running total for this row using the "Quirky Update" -- and a "Pseudo-cursor"

    UPDATE

    dbo.Ledger

    SET

    @AccountRunningTotal = RunningTotal =

    CASE WHEN AccountID = @PrevAccountID THEN CASE WHEN @AccountRunningTotal > 0

    THEN @AccountRunningTotal + (Credit - Debit)

    ELSE @AccountRunningTotal - (Debit - Credit) END

    ELSE (Credit - Debit)

    END,

    @PrevAccountID = AccountID

    FROM

    dbo.Ledger WITH (TABLOCKX)

    WHERE

    LedgerType = 'Account'

    OPTION

    (MAXDOP 1)

    Martin... I haven't forgotten you on this problem... I just got a little overwhelmed with a bunch of other stuff including some duties at work (not to mention the 2:15 - 2:30 round trip). It's apparent that you have something that works or you wouldn't have the column to compare to and thought you wouldn't mind waiting just a bit more... I'm pretty sure I know what's wrong but, being the data troll that I am, I wanted to show you what's wrong and how to fix it using demonstrable code.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • "Martin... I haven't forgotten you on this problem... I just got a little overwhelmed with a bunch of other stuff including some duties at work (not to mention the 2:15 - 2:30 round trip). It's apparent that you have something that works or you wouldn't have the column to compare to and thought you wouldn't mind waiting just a bit more... I'm pretty sure I know what's wrong but, being the data troll that I am, I wanted to show you what's wrong and how to fix it using demonstrable code. "

    --Jeff Moden

    No Problem Jeff

    The topic certainly seems to be diving off in all directions!

    Martin

  • Condorman (11/17/2009)


    From SQL2005 and above, I wonder how a persisted computed column would perform. The function that populates the column would only need to find the most recent running total value and apply the transaction amount. In the event that no previous transactions exist, the transaction amount would serve as the running total value. Inserting a low number of rows should perform quite well. The downside is that inserts at the beginning of the series would cause a cascading re-calculation for the table. It seems to me that a good portion of the applications for this type of functionality might disallow back-dated transactions, so even if the performance sucks across the entire table, this might not be an issue.

    It wouldn't work at all. A computed column based on a function that does data access cannot be persisted. An indexed view wouldn't work either, for the same reason.

  • Condorman (11/17/2009)


    In the event that members of a particular sequence cannot or need not be reliably ordered, ROW_NUMBER ensures the integrity of the running total value, whereas RANK breaks the function. Somewhere I developed an inconvenient preference that my systems fail gracefully rather than catastrophically, thus my suggested fix.

    I'd rather the design prevented the situation arising in the first place 🙂

    So I guess I'm with Alexander Kuznetsov on this one.

  • Paul White (11/18/2009)


    It wouldn't work at all. A computed column based on a function that does data access cannot be persisted. An indexed view wouldn't work either, for the same reason.

    Bah! That's bloody inconvenient...

    Oh well. I still say ROW_NUMBER ensures minimal down-time during recovery. The readers are, of course, directed to choose the method which best fits their needs.

    In any case, it's been a pleasure sparring with you.

  • Same here 🙂

  • Great stuff, Jeff (as always). I hope I have chance to compare some of the alternate methods presented in the thread too...

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Condorman (11/14/2009)


    Really interesting discussion guys. I do have one complaint about Hugo's alternative method, though.

    Here in the relevant bit from Hugo's code.

    INSERT INTO @Results(AccountID, Date, TransactionDetailID, Amount,

    RunningTotal, Rnk)

    SELECT AccountID, Date, TransactionDetailID, Amount, Amount,

    RANK() OVER (PARTITION BY AccountID

    ORDER BY Date,

    TransactionDetailID)

    FROM dbo.TransactionDetail;

    [/code]

    The insert into the temp table uses the Rank() function, which the while loop compares against an incrementing counter. Assuming that we can guarantee a unique combination of Date-TransactionDetailID within each AccountID partition, this works as expected. Indeed, the data is such in this proof-of-concept example that there are no problems.

    When this is translated to a real world situation, though, I would strongly encourage ROW_NUMBER() over RANK() for the following reason:

    Remarks

    If two or more rows tie for a rank, each tied rows receives the same rank. For example, if the two top salespeople have the same SalesYTD value, they are both ranked one. The salesperson with the next highest SalesYTD is ranked number three, because there are two rows that are ranked higher. Therefore, the RANK function does not always return consecutive integers.

    http://msdn.microsoft.com/en-us/library/ms176102.aspx

    Hugo's method depends on unique, consecutive integers, and it will fail in the event that an implementation uses a non-unique order by clause. ROW_NUMBER() solves this problem with a minimum of fuss.

    Hi Condorman,

    Thanks for your comment. And my apologies for the much delayed reply. Somehow my subscription to this discussion has disappeared so I was not notified of new replies anymore. I only saw how much discussion has been added since my last appearance here when I returned here for a completely different reason.

    As to the content of your objection, I'd like to point out that I have already taken care of the possible problem caused by duplicates. Please check my code. The order for the requested running totals would be based on ascending Date within each AccountID. And yet, my code uses:

    RANK() OVER (PARTITION BY AccountID

    ORDER BY Date,

    TransactionDetailID)

    Note the emphasis on TransactionDetailID. From the perspective of the business needs for this report, there is no reason at all to include this column in the ORDER BY part. I added it precisely for the reason you indicate - because the (AccountID / Date) combination can have duplicates that would result in non-unique rank numbers. Adding TrasactionDetailID, which is the primary key of the table, guarantees me that there never can be duplicates.

    So yes - in a real-world situation there can be duplicates within the grouping and ordering requirements for the running totals. My code solves this by adding the primary key of the table to the list of ordering columns.

    If you have no primary key in your real-world production table, then this would indeed not work - but in that case, you have bigger concerns than not being able to use a running total method that is both fast and fully documented!


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

Viewing 15 posts - 91 through 105 (of 307 total)

You must be logged in to reply to this topic. Login to reply