Solving the "Running Total" & "Ordinal Rank" Problems in SS 2k/2k5

  • yeah, but we aren't talking about _small_ performance gains here... more like bringing a 6+ hour process that nails the server down to under a minute... with negligable load.

    I would also point out that the quote about shared scans seemed to indicate it applied to full table scans. Given that this would be a clustered index scan, does the point still apply?

  • Yes - a clustered index scan IS literrally a full table scan.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Sir Slicendice (1/31/2008)


    No amount of testing and adding hints can ever make up for using functionality that is explicitly unsupported. What happened to defensive programming? Are we really willing to throw that all away for a speed improvement? Getting occasionally wrong results back really quickly isn't really a win....

    I'm just curious - what is *specifically unsupported* in what is being offered? No - really.

    As to the performance gains, well - I still remember spending weekends babysitting a set of workstations that were slowly grinding through the processing of our year-end reporting, just in case one of them "died" and needed restarting. And then - one year, databases and aggregate functions come along, and *poof* - report is done in 2 hours.

    If I can do that same kind of performance gain, and I can test that the results are reproducible, predictable, and correct, and take a process that runs for 12 hours and finish it in 25 minutes, why wouldn't I?

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • FYI I've ONCE come across a cursor using MS's FOR UPDATE enhancement which was WAY faster than any other update method I tried (all 'best practices')...can't remember the details, but I'm pretty certain those tri-join-rbar operations were involved.

    But the procedure was a configuration s/proc i.e. only used now & again...and for sporadic / admin type operations it can be prudent.

    So is a data warehouse nightly build sporadic enough? Ahhhh....well....


    Regards
    Andy Davies

  • Matt Miller (1/31/2008)


    Sir Slicendice (1/31/2008)


    No amount of testing and adding hints can ever make up for using functionality that is explicitly unsupported. What happened to defensive programming? Are we really willing to throw that all away for a speed improvement? Getting occasionally wrong results back really quickly isn't really a win....

    I'm just curious - what is *specifically unsupported* in what is being offered? No - really.

    The ONLY way to guarantee order is with ORDER BY. Microsoft explicitly does not guarantee row order when using a clustered index. Most (99.99+%) of the time, the rows will come back in the order you expect. But there are some small number of cases where they will not (others here can explain these cases better than I can) and in those case, this procedure will generate incorrect data.

    That's a show-stopper for me.

  • Andy (1/31/2008)


    FYI I've ONCE come across a cursor using MS's FOR UPDATE enhancement which was WAY faster than any other update method I tried (all 'best practices')...can't remember the details, but I'm pretty certain those tri-join-rbar operations were involved.

    But the procedure was a configuration s/proc i.e. only used now & again...and for sporadic / admin type operations it can be prudent.

    So is a data warehouse nightly build sporadic enough? Ahhhh....well....

    The problem I'd the 'predictable' part: sql is a set-oriented functional language that says row order is undefined (other than in situations where order by applies); therefore, you are relying on an explicitly unsupported implementation artifact that is quite unpredictable in the larger scale. Its an interesting detail, and gives clues to how the engine is implemented, but to rely on this sort of thing goes against all principles of defensive programming. Didn't we learn anything from Y2K and all the word size evolutions?

    The speedup is irrelevant if your results can be wrong - and note that if you want to use some sort of probabalistic argument to claim your results will be right to some acceptible tollerance, you would need more knowlege of implementation details that you don't have....


    The End.

  • I will derail this conversation and give it another spin.

    Wether or not something is supported by Microsoft does not make it safer.

    It is all about probability and bugs.

    What is the likelihood to be in a situation where the author article sample will yield wrong result?

    OK, I have a couple examples where MS supported stuff is plain wrong and will have to wait a long long time before any (private) fix. It is completely unrelated to this article but still, this article sample may fail in a few cases while some supported features will fail in much more frequent occurrences. So to be safe, you have to test test test test and in all possible configurations possible and whenever you apply a patch, which is not your job if you are a developer. It is MS job. I would simply expect that if MS make something go to RTM and support it, You should reasonably be able to rely on it (after your testing, check first, trust after) and be AWARE of MS patches.

    Example 1, Excel SP3, breaks existing queries that use a multi-select page filter, I am waiting for a hot fix.

    Example 2, ssis data provider "Native SQL Client" does not work reliably in every case, No acknowledgment of the issue so far, see this example(not unique, there are more) for details:

    http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=2755462&SiteID=1

    To conclude, what is the statistical likelihood of getting wrong result when using the "unsupported" code provided by Jeff as opposed to the likelihood of getting wrong results by using any of the "Supported" MS features?

    I would say it all comes down to the simple fact that you have to test it for your particular application and environment, and keep an eye on it. One would assume that sticking to supported code would decrease your chances of getting hit, OK but when you are hit with 400 users on a supported feature that takes 6 months to fix, the statistic is fading away.

    You also have trouble with supported features just by using 64 bits rather than 32 bits. You just have to know if the supported or not feature works in this environment.

    A last example is the queries against system tables that are not supported, works in a version and not in another. I would say, if you have to use these, just be aware that supported or not, it may break at next patch or hardware upgrade. It is still less work to fix this rahter than learning a whole new programming scheme involving calling API's and using SOAP or web services, where NO BOL or examples exist.

    Sorry for the digression, I could not resist to ad this grain of salt to the discussion :hehe:

    BI Guy

  • Christian Buettner (1/31/2008)


    Hello Jeff,

    very interesting article.

    Somehow it throws all my beliefs into the trash bin:

    1. The update method with variables is not explicitly supported by Microsoft

    2. The clustered index does not guarantee ordered results

    3. Order By in derived tables does not guarantee ordered results.

    Your "evidence" does not really convince me. It could be just a lucky day (or a lot of them) ;).

    If anyone would have some "official" documentation on these 3 topics, that would be great!

    Btw: Did anyone try the "Order By" Solution?

    It returns wrong results for me, but maybe did not copy the code correctly (somehow the linefeeds

    disappear with c/p).

    First, I really thank all of you folks that took the time to respond to this article whether you agree or not. That type of response is worth much more than the article itself. Thank you one and all.

    I quoted this single reply because it seems to embody the concerns in most of the posts... (Thanks for the post, Christian... not picking on you... your post just seems to be one of the shorter ones that still gets all of the points across).

    So far as Item 1 goes, a large number of people have already sited that the UPDATE @variable = columnname = expression/formula is in fact both supported and documented as such in Books Online under "UPDATE". No... it certainly isn't ANSI code... and it wasn't meant to be. This is, after all, an article on SQL Server.

    So far as Item 2 goes, the clustered index does, in fact guarantee the order of the update without an order by if the correct clustered index hint is used and that is also documented in Books Online in several areas. In fact, a clustered index guarantees the physical order of the data... please refer to "clustered indexes, architecture" where it states, and I quote...

    "Clustered tables are tables that have a clustered index.

    The data rows are stored in order based on the clustered index key. The index is implemented as a B-tree index structure that supports fast retrieval of the rows based on their clustered index key values. The pages in each level of the index, including the data pages in the leaf level, are linked in a doubly-linked list, but navigation from one level to another is done using key values."

    Also...

    "SQL Server also supports up to 249 nonclustered indexes on each table or indexed view. The nonclustered indexes have a B-tree index structure similar to the one in clustered indexes. The difference is that nonclustered indexes have no effect on the order of the data rows. Clustered tables and indexed views keep their data rows in order based on the clustered index key. The collection of data pages for a heap is not affected if nonclustered indexes are defined for the table. The data pages remain in a heap unless a clustered index is defined."

    I'm thinking that you're safe on the update as I've posted it... and consider the test code... it built a million row table a 100 rows at a time using highly randomized data. Even the worst users in the world would have a tougher time of fragging a clustered index that much between index defrags. Yet the code still works. Sure, feel free to draw your own conclusions, but unless you have some definitive code to prove the contrary, I wouldn't be so quick to toss away such a powerful tool based on the speculation of others.

    For item 3... again, show me an instance where it doesn't work on an update that does a running total as I've depicted. Also, keep in mind that I actually don't recommend using that method to do the running total... I actually recommend using the clustered index method even if it means copying the data to a temp table just to get the right clustered index.

    So far as spawning parallelism goes, I've executed this code on both a 4 processor and an 8 processor box during quiet times... I can't make it spawn a parallel update as it's written in the article. I see much speculation on what it might do... but I can't make it spawn parallelism and think that it may never spawn parallelism simply because of the hints used to override the optimizer with.

    If you know how to force the parallelism that is being speculated on, please post the method so we can check it out for ourselves.

    And, that should be the general theme for all those that say the methods posted in the article may not work... the article went to great lengths to demonstrate where things can go wrong and how to make it right through the use of demonstrable code. If you know something won't work, please post the data generation and demonstrable method code that proves it won't work. That way, we all learn something and I don't have to pretend I'm from Missouri 😉

    Again, whether you're pro or con about the methods in the article, thanks for your thoughts and your interest in the subject. :w00t:

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • ALZDBA (1/31/2008)


    Very good article, nice testcode. :w00t::cool:

    There may be some testing needed regarding parallelism,

    maybe for now just add a OPTION(MAXDOP 1).

    The only remarks I would add to the conclusion is :

    - Keep in mind that by design there is no order in a setbased approach

    - For now it works and it is the fasted propriatary solution for the problem.

    Remeber the views with order by clause that suddenly nolonger

    externalized the order in SQL2005, but worked so well in SQL2000.

    Thank you for sharing the knowledge.:smooooth:

    Always a pleasure, Johan... Thank you for the feedback.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Ian Yates (1/31/2008)


    I have to agree with the naysayers unfortunately. It's great at the moment and really does seem to work (I haven't seen it fail yet either). I was always suspicious and slightly annoyed when people used the select top 100 percent with an order by in a view - dirty hack in my opinion. This could go the same way. Having said that, perhaps we could get someone from the MS Query Engine team to elaborate on

    * The update syntax of @variable = col = newValue

    * What happens if you have multiple of these in the same update. eg

    @variable = col = col + @variable

    col2 = col + @variable

    Does col2 effectively get col + col + @variable or just col + @variable?

    * Forced ordering.

    I'm going to run some quick tests on my sql 2k5 installation to see if there are any quirks (I'd be very surprised if I found something that was overlooked by the likes of Jeff though 😛 )

    Thanks for the mighty fine compliment, Ian. Thank you...

    Yes, Col2 gets the value of col + col + variable in the case you cited. As always, here's the code to prove it...

    --===== Create and populate the test table

    CREATE TABLE #SomeTable

    (

    RowNum INT IDENTITY(1,1),

    Col INT,

    Col2 INT,

    CONSTRAINT PK_SomeTable PRIMARY KEY CLUSTERED (RowNum)

    )

    INSERT INTO #SomeTable

    (Col)

    SELECT 1 UNION ALL

    SELECT 2 UNION ALL

    SELECT 3

    --===== Display the original content of the table

    SELECT * FROM #SomeTable

    --===== Do the cascaded variable update that Ian pointed out

    DECLARE @Variable INT

    SET @Variable = 10

    UPDATE #SomeTable

    SET @Variable = Col = Col + @Variable,

    Col2 = Col + @Variable

    FROM #SomeTable WITH (INDEX(PK_SomeTable),TABLOCKX)

    --===== Display the results

    SELECT * FROM #SomeTable

    DROP TABLE #SomeTable

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff,

    There's no disagreement that SQL Server stores rows for tables that have a clustered index in index order (more particularly, the data for the table is effectively stored in the index). However, this information is provided in the spirit of guiding decisions that can have a very large impact on performance -- it is a discussion of the storage implementation of the data.

    However, the engine has no obligation to scan the table in index order, as there are no language constructs which impose an ordering on the execution order of a scan. The fact that it happens to work when you test it in one or a dozen different configurations is quite irrelevant; you will always be rolling the dice with each query execution, as you have no knowledge of the internals of how/why the engine is choosing to process the rows in your desired order.

    Think of this like various programming tricks through the years; e.g.: writing to a null pointer in C. We were never supposed to do this; the results of *(0) = x; were always undefined on all machines. But on a lot of machines you could get away with this, and it was (generally) faster than checking if the pointer was null, especially if you were doing a lot of these. Then one day, it stopped working and generated a machine check....

    It is interesting to think of the difference between a defined feature that is broken versus an undefined feature that works; that goes to show that defined behavior and testing are both necessary but not sufficient to prove absence of defects. Users of financial reporting, however, are unlikely to be comfortable with some sort of probabalistic argument for how reliable your report is; OTOH if your users are high energy physics people, they might enjoy that little diversion, but they they will want a confidence interval which you won't be able to supply....

    -frank

    [Apologies for quoting the wrong post earlier.]


    The End.

  • However, the engine has no obligation to scan the table in index order, as there are no language constructs which impose an ordering on the execution order of a scan.

    Sure it does, Frank... it's a clustered index and scanning the index inherently means it's also scanning the table (believe someone else also said that on this thread). And, yeah, the WITH INDEX hint forces the optimizer to use that clustered index... That's also documented quite well in Books Online. It's not a kludge and it will always work.

    But, yes, with most ideas that are way outside the box (heh, or, off the wall, if you'd rather), it's difficult to trust the new idea, just as you said. That's why I also included some handy verification code so that you can test the result in production code. Even a banker is going to like that. And, the combination of the update and the verification code are still going to be faster than the cursor method (or any other method, for that matter).

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • James Goodwin (1/31/2008)


    Did you test using the Clustered Index with the ORDER BY subquery? It seems to me that the ORDER BY Would enforce the order, but since you are using a Clustered Index it wouldn't cause much overhead unless you were going to get bit by a disordered resultset anyway.

    --

    JimFive

    No... I didn't test for that and the article got a wee bit long anyway. Was really trying to show that either an Order By (regardless of what the Clustered Index was) would work or that a Clustered Index on the correct columns would work.

    Thanks for the feedback and the suggestion, Jim. I'll give it a try a post the results...

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Philippe Cand (1/31/2008)


    I will derail this conversation and give it another spin.

    Wether or not something is supported by Microsoft does not make it safer.

    It is all about probability and bugs.

    What is the likelihood to be in a situation where the author article sample will yield wrong result?

    OK, I have a couple examples where MS supported stuff is plain wrong and will have to wait a long long time before any (private) fix. It is completely unrelated to this article but still, this article sample may fail in a few cases while some supported features will fail in much more frequent occurrences. So to be safe, you have to test test test test and in all possible configurations possible and whenever you apply a patch, which is not your job if you are a developer. It is MS job. I would simply expect that if MS make something go to RTM and support it, You should reasonably be able to rely on it (after your testing, check first, trust after) and be AWARE of MS patches.

    Example 1, Excel SP3, breaks existing queries that use a multi-select page filter, I am waiting for a hot fix.

    Example 2, ssis data provider "Native SQL Client" does not work reliably in every case, No acknowledgment of the issue so far, see this example(not unique, there are more) for details:

    http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=2755462&SiteID=1

    To conclude, what is the statistical likelihood of getting wrong result when using the "unsupported" code provided by Jeff as opposed to the likelihood of getting wrong results by using any of the "Supported" MS features?

    I would say it all comes down to the simple fact that you have to test it for your particular application and environment, and keep an eye on it. One would assume that sticking to supported code would decrease your chances of getting hit, OK but when you are hit with 400 users on a supported feature that takes 6 months to fix, the statistic is fading away.

    You also have trouble with supported features just by using 64 bits rather than 32 bits. You just have to know if the supported or not feature works in this environment.

    A last example is the queries against system tables that are not supported, works in a version and not in another. I would say, if you have to use these, just be aware that supported or not, it may break at next patch or hardware upgrade. It is still less work to fix this rahter than learning a whole new programming scheme involving calling API's and using SOAP or web services, where NO BOL or examples exist.

    Sorry for the digression, I could not resist to ad this grain of salt to the discussion :hehe:

    Actually, Philippe, that's a great diversion and a position that I support...

    Folks, there's absolutely no guarantee that MS will keep a particular method, function, stored proc, or functionality the same. I'll give you one of the best examples of something that I absolutely loved that they took away in 2k5... Query Analyzer and it's wonderful {f4} key! Both were supported by MS to the max... but it's gone, now.

    Sure, sure... I agree... you can't go nuts and expect your code to survive even the next Service Pack... but everything you use is subject to change by MS, supported or not, with or without warning. Just look at what service pack 4 did to a lot of 2k users for performance...

    Still, I haven't used anything that's undocumented. The action of the UPDATE is documented in BOL. The action of Clustered Indexes is well documented in BOL. The action of WITH INDEX is well document in BOL. Everything that was used in the article is, in fact, well documented in BOL.

    And, if you still don't trust it, I even provided validation code... the combination of the update and the validation code still runs faster than any other method.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • RichardB (1/31/2008)


    Pure Awesomeness!

    Run over 1.4million records on a 6 level 'hierarchy' in about 10 s - compared to the 'set' solution that takes 1.5hrs...

    :exclamationmark:

    Oh yes, and validation against the previous values indicated no differences. Mindblowing.

    Thanks for possibly the single most impressive bit of code I have seen on here! :smooooth:

    Rich

    Man... thank you for the feedback, Richard! There's nothing more satisfying than to see someone compare something like this against existing methods. Thanks again!

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 31 through 45 (of 250 total)

You must be logged in to reply to this topic. Login to reply