Starting with "WITH"

  • ChrisM@home (11/8/2011)


    The Dixie Flatline (11/7/2011)


    I agree with Chris.

    Because I inadvertantly started (this very entertaining and informative discussion) by making a "suckish" comment, I feel compelled to admit that I have in fact used recursive CTEs in inline table valued functions to drive "loop" logic that feeds a small list of parameters to a query that results in tens of thousands of rows. This divide and conquer strategy worked well, but the RBAR was limited to less than a dozen rows, each of which was fed to a query which produced thousands of rows.

    Used correctly recursive CTEs are powerful tools and have their place in the toolbox.

    Used incorrectly, they can and will suck. Novice users need to be aware of this, that's all.

    You should have been a polititian, Bob. "I only tried it once, as a freshman. It wasn't lit and I didn't inhale":-D

    Maybe he is endeavoring in politics. He has been pretty absent from the forums for the past year 😉

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • Has it really been a year? Oh wow....

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • The Dixie Flatline (11/8/2011)


    Has it really been a year? Oh wow....

    At a minimum.

    How are things going anyway?

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • Not too badly, Jason. I'm enjoying every breath I take. I remember that you had moved back to Utah. How are things going for you?

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Things are real good right now. Fresh snow, cold weather and keeping busy with work and kids.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • SQLRNNR (11/7/2011)


    SQL Kiwi (11/7/2011)


    I would happily refer to rCTEs (and nested loops plans in general) as RBR rather than RBAR; it's the 'A' that I have issues with 🙂

    Mind you, everything's RBR when you get right down to the nuts and bolts of it...

    Yeah - when you start looking at it that way. 😀

    I agree... most everything at the machine language level of SQL Server is RBR because "file handlers" just can't do otherwise.

    I say "most everything" because there are exceptions, like rCTEs, that I'll continue to classify as RBAR. As described in Craig Freeman's article, rCTEs are nothing less than a push stack that touches each row in the stack at least 4 times... once to read it or calculate it, once to push it onto the stack, once to read it/use it , and once to delete it. The performance problems associated with counting-rCTEs compared to other methods of counting should be proof enough that they are, in fact, a form of "Hidden RBAR". 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (11/8/2011)


    I say "most everything" because there are exceptions, like rCTEs, that I'll continue to classify as RBAR.

    That seems inconsistent to me. I prefer something closer to Chris' definition of RBAR: an explicitly procedural and looping coding style that results in agonizingly slow performance, while being unnecessarily inefficient. You only have to look at the examples Chris posted to see that recursive CTEs can be extremely efficient, and arguably the 'best' solution. Like any tool, recursive CTEs can be used wrongly, but that does not justify a blanket classification of RBAR.

    As described in Craig Freeman's article, rCTEs are nothing less than a push stack that touches each row in the stack at least 4 times... once to read it or calculate it, once to push it onto the stack, once to read it/use it , and once to delete it. The performance problems associated with counting-rCTEs compared to other methods of counting should be proof enough that they are, in fact, a form of "Hidden RBAR". 😉

    By that logic, you would classify anything that uses a common sub-expression spool as RBAR. For example, many of the windowed aggregates (and, arguably, the new window spool in SQL Server 2012) would fall into your RBAR category. Consider this query (2005 and above):

    USE AdventureWorks;

    SELECT

    INV.Bin,

    INV.Quantity,

    INV.ProductID,

    INV.Quantity,

    SubAgg.total_in_bin

    FROM Production.ProductInventory AS INV

    INNER JOIN

    (

    -- Total quantity in each bin

    SELECT

    INV2.Shelf,

    INV2.Bin,

    total_in_bin = SUM(INV2.Quantity)

    FROM Production.ProductInventory INV2

    GROUP BY

    INV2.Shelf,

    INV2.Bin

    ) AS SubAgg ON

    SubAgg.Shelf = INV.Shelf

    AND SubAgg.Bin = INV.Bin

    WHERE

    INV.Shelf >= 'A'

    AND INV.Shelf <= 'C'

    ORDER BY

    INV.Shelf,

    INV.Bin;

    This is a very efficient query, and entirely set-based, and yet uses spools in the same way you find to be RBAR in the recursive CTE case!

  • If we went by Chris' definition of RBAR, then Triangular joins would be considered to be set based code... and they're not because they "touch" most rows more than once... even though you have shown that they can be made quite fast with some clever indexing. That still doesn't change the fact that they're hidden RBAR.

    rCTE's are another form of "Hidden RBAR" because of the "push stack" style of spool they use. The rows are inserted, selected, and deleted.

    For the spools you have in your query, I wonder how fast the system would work on it without the spools? After all, most spools are also a form of Hidden RBAR because they are first written to and then read. Sure, you're query is fast... it would be faster without the spools.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • So, anything that touches rows more than once is RBAR? I think we have to use a reductio ad absurdum argument to convince you that this is a silly definition, because if Paul's query doesn't convince you, and neither do Chris's, it's obvious that using examples where touching some rows more than once works extremely well won't convince you. I don't like arguing that way, because it usually looks aggressive or snide, but I don't see any other way of getting you to admit that on that definition there is no way you can maintain that RBAR is always undesirable or that there is always a conflict between set-based and RBAR.

    Using your definition:

    A sort is always RBAR (except when we do it by using an index on the sort key; even if we have enough RAM to hold one slot reference for every possible value of the sort key so that we can do a counting sort we have to touch each row twice, and as for slower sorts like quicksort, or merge sort, or any other known sort ... well, definitely RBAR).

    So a query using ORDER BY must be RBAR unless there is an existing index on the sort key, and the query plan is forced to use a seek+scan on that index no matter how inefficient that is, since that touches each row only once even if the query doesn't need to touch most of those rows at all because use of a different index would have eliminated most of them. That's a consequence os a general rule against touching rows more than once, which means that touching 100,000,000 rows once each is better than touching 5 rows three times each.

    But wait: how do we get that index? Creating an index involves sorting, unless there are no rows at the time the index is created. So creating an index on a non-empty table is clearly RBAR, and to be avoided.

    Equally, getting an aggregate (whether partitioned or not) is of course RBAR because it touches its output rows many times - unfortunately there is no getting round this and still having aggregates, so to avoid being RBAR we must never write a query that uses an aggregate.

    Even simple equi-joins are RBAR, since a row in the left hand table may match more than one row in the right hand table, and vice-versa, and that will involve touching those rows more than once; an equijoin is RBAR unless it's an equijoin on equality of candidate keys! A natural join is RBAR if it's the natural join of two tables that recreates the original table that we had before splitting it in two to get from 4NF to 5NF! If we can't get the joined information out without going RBAR, we'd better not do the split. Same for a natural join that is needed because we normalised from 1NF to 2NF. Oh dear, on this definition of RBAR we'd better abandon 2NF, 3NF, EKNF, BCNF, 4NF, 5NF, 6NF and DKNF because doing any normalisation at all will force us to write queries that are inherently RBAR. Our databases will have far far far far far more rows in them that if we normalised, but at least our queries will more often touch them only once.

    I don't want to go on with this nonsense, because surely even with as little of it as that it must be patently obvious that it is nonsense. But it's all an inevitable consequence of (A) the assertion that anything that touches a row more than once is RBAR and is not set-based combined with (B) the assertion that anything RBAR is bad and undesirable. That means that one of those two assertions is absurd, and I'm pretty sure that with a sensible concept of what RBAR is it won't be (B).

    Tom

  • L' Eomot Inversé (11/12/2011)


    That's a consequence os a general rule against touching rows more than once, which means that touching 100,000,000 rows once each is better than touching 5 rows three times each.

    Gosh, Tom. I never suggested anything so absurd or silly. Touching 100 million rows to process 5 would be RBAR on steroids. Touching 5 rows three times each is also a form of RBAR if there's a solution that would allow you to more efficiently do it with a single touch. If you are doing three totally different things to the 5 rows and those things cannot be effectively combined, then it might not be RBAR. It might simply be applying 3 different requirements to the same rows.

    Shifting gears a bit, consider that Paul found a marvelous way to greatly increase the performance of a Triangular Join to aggregate running totals with proper indexing and code... so much so that it appeared to be a viable replacement for the Quirky Update. Does that make Triangular Joins any less RBAR? Is performance the only measure as to whether something is RBAR or not? My answer is "No" to both especially if there's an equally performant bit of code that uses fewer resources.

    Is all RBAR bad? Must all RBAR be avoided? No... Not if there's not a more efficient manner to process the data. The "lasagne" code necessary to convert an Adjacency List to a Hierarchical Path is a good example of that and is actually an example of RBAR and set-based code combined (think of your 5 row example here). Each row must be handled at least twice which forms the RBAR segment of the code but multiple rows are processed in sets known as "levels". To the best of my knowledge, there's no other way to do such a thing. But, for sure, the RBAR of a Triangular Join should be avoided because there are better ways to do the same thing even if some of those ways are other less expensive forms of RBAR.

    But let’s temporarily forget about our differences for what the definition of RBAR actually is for now. Let’s peel one potato at a time. Returning to the subject at hand... are recursive CTEs RBAR? Yes. Even the execution plan for a counting-rCTE shows single row processing. Run the following counting-rCTE with the Actual Execution plan turned on and see. Look at the actual number of "rebinds" in the "Filter" block of the Actual Execution plan.

    WITH

    cteCounter AS

    ( --=== Counter rCTE counts from 0 to 11

    SELECT 0 AS N --This provides the starting point (anchor) of zero

    UNION ALL

    SELECT N + 1 --This is the recursive part

    FROM cteCounter

    WHERE N < 1000

    ) --=== Add the counter value to a start date and you get multiple dates

    SELECT StartOfMonth = DATEADD(mm,N,'2011')

    FROM cteCounter

    OPTION (MAXRECURSION 0)

    ;

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • It rather depends on whether you write the dumbest possible recursive solution or not:

    WITH rCTE AS

    (

    SELECT 1 AS N

    UNION ALL

    SELECT N

    FROM rCTE

    CROSS JOIN (

    VALUES

    (1),(1),(1),(1),(1),

    (1),(1),(1),(1),(1),

    (1),(1),(1),(1),(1),

    (1),(1),(1),(1),(1)) AS V (v)

    )

    SELECT TOP (1000)

    StartOfMonth =

    DATEADD(MONTH,

    ROW_NUMBER() OVER (ORDER BY @@SPID)-1,

    CONVERT(DATE, '20110101'))

    FROM rCTE

    ORDER BY ROW_NUMBER() OVER (ORDER BY @@SPID)

    OPTION (MAXRECURSION 0);

  • Jeff Moden (11/12/2011)


    Is all RBAR bad? Must all RBAR be avoided? No... Not if there's not a more efficient manner to process the data.

    It all does come down to efficiency. There are efficient and inefficient examples of set-based, iterative, and recursive solutions of course. This is why I am uncomfortable with the term 'RBAR': it is too often taken to mean that all iterative or recursive solutions are less efficient than a set-based solution. Why else does the 'A' stand for agonizing?

    The triangular join is a good example. It's hard to argue that a naive triangular join is anything other than set-based; it's just inefficient in many cases. There's nothing 'magic' about the indexing I use to prevent some examples of a triangular join from touching the whole triangle of rows per iteration - it's just an efficient implementation.

    So, overall, I think it is much more useful to talk about efficiency in general than to try to shoe-horn somewhat fuzzy concepts like 'RBAR' and 'Hidden RBAR' into the mix. Efficiency, it seems to me, comes down to doing the least work possible to return the required results.

  • Jeff Moden (11/12/2011)


    But let’s temporarily forget about our differences for what the definition of RBAR actually is for now. Let’s peel one potato at a time. Returning to the subject at hand... are recursive CTEs RBAR? Yes. Even the execution plan for a counting-rCTE shows single row processing.

    The trouble is Jeff, that you are jumping from "counting recursive CTEs are RBAR", which no-one (as far as I know) has disagreed with (because certainly in the current MS implementation they are RBAR), to "all recursive CTEs are RBAR". That's as bad as jumping from "triangular joins are RBAR" to "all joins are RBAR".

    If you look back to the beginning of this discussion, you'll see (if you don't remember) that I suggested that a recursive CTE in any statement that could be written cleanly as a single SQL statement which didn't use a recursive CTE (not just a counting CTE) should be considered undesirable and avoided. But a recursive CTE used in the case where something can't be done any other way except by using a cursor isn't undesirable, it's a means of avoiding an absolutely row by row solution. That isn't based on just the performance difference. It's based on the fact that as a mathematician I consider recursive definitions as the normal method of definition in set theory, so they are certainly set based, and as an engineer I know that if properly implemented they would not have to operate one row at a time.

    If rCTEs had been around in the 1980s we would have had a massively parallel implementation of them in our flagship data engine at ICL; by 1990 we, and other members of the European Declarative System project, were suggesting they should be included (as part of a more general recursive query feature) in the next SQL standard; we achieved good massively parallel implementation of logic languages where the fundamental objects are predicates expressed as recursively defined sets of tuples (which is of course exactly what recursive CTEs are) so we knew we could make recursive queries and recursive definitions of rowsets fly without doing things row by row.

    edit: get [/quote] in the right place!

    Tom

  • SQL Kiwi (11/13/2011)


    It rather depends on whether you write the dumbest possible recursive solution or not:

    OK, going to 1000 yours has 67% of the logical (row) reads that the naive one has and only half the CPU usage. Neither has any physical reads. But it takes 25% more elapsed time. I get that consistently on my laptop.

    I wonder why?

    Tom

  • L' Eomot Inversé (11/13/2011)


    OK, going to 1000 yours has 67% of the logical (row) reads that the naive one has and only half the CPU usage. Neither has any physical reads. But it takes 25% more elapsed time. I get that consistently on my laptop. I wonder why?

    Not sure - the CROSS JOIN rCTE is consistently faster for me, though the difference becomes more pronounced as N gets larger. FYI this is the script I used:

    SET STATISTICS IO, TIME ON;

    WITH

    cteCounter AS

    ( --=== Counter rCTE counts from 0 to 11

    SELECT 0 AS N --This provides the starting point (anchor) of zero

    UNION ALL

    SELECT N + 1 --This is the recursive part

    FROM cteCounter

    WHERE N < 10000

    ) --=== Add the counter value to a start date and you get multiple dates

    SELECT StartOfMonth = DATEADD(mm,N,'2011')

    INTO #a

    FROM cteCounter

    OPTION (MAXRECURSION 0);

    WITH rCTE AS

    (

    SELECT 1 AS N

    UNION ALL

    SELECT N

    FROM rCTE

    CROSS JOIN (

    VALUES

    (1),(1),(1),(1),(1),

    (1),(1),(1),(1),(1),

    (1),(1),(1),(1),(1),

    (1),(1),(1),(1),(1)) AS V (v)

    )

    SELECT TOP (10000)

    StartOfMonth =

    DATEADD(MONTH,

    ROW_NUMBER() OVER (ORDER BY @@SPID)-1,

    CONVERT(DATE, '20110101'))

    INTO #b

    FROM rCTE

    ORDER BY ROW_NUMBER() OVER (ORDER BY @@SPID)

    OPTION (MAXRECURSION 0);

    DROP TABLE #a, #b

    SET STATISTICS IO, TIME OFF

    Table 'Worktable'. Scan count 2, logical reads 60007, physical reads 0

    CPU time = 125 ms, elapsed time = 128 ms.

    Table 'Worktable'. Scan count 2, logical reads 40569, physical reads 0

    CPU time = 78 ms, elapsed time = 99 ms.

    For clarity though, I just wanted to show a recursive plan without the Filter Jeff was objecting to, and where the number of executions on the lower level was much reduced. I'm not advocating rCTEs for this sort of task.

    edit: Make sure you run this with Actual Execution Plan off, without any traces running, that sort of thing. Also, one thing I forgot to mention - the original CTE doesn't like having an ORDER BY N added (the plan needs an explicit sort). I added one to my code because I wanted to be sure of the presentation ordering (without relying on undocumented behaviour) and it comes for free 🙂

Viewing 15 posts - 31 through 45 (of 55 total)

You must be logged in to reply to this topic. Login to reply