Eliminating Cursors

  • Outstanding, Kevin! Thanks for posting the code.

    Yep, Matt was right... One test is worth a thousand words... let the "test off" begin. To make it easier for everyone to test the same thing, I'll rewrite your good examples and use them against some code that will generate a million rows in just about a minute... that way, everyone can play.

    I'll also throw in some set-based code to show that you should NEVER settle for a cursor or a While loop... it's too damned easy to do it set based.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Wasn't sure if you wanted the indexes at all. Here they are

    CREATE CLUSTERED INDEX [SITE_C_1] ON [dbo].[SITE]([Site_Addr_St_Name], [Site_Addr_St_Num], [Site_Addr_St_Suff], [Site_Addr_St_Post_Dir]) ON [FILEGROUP4]

    GO

    CREATE UNIQUE INDEX [SITE_UPKNC] ON [dbo].[SITE]([Site_Id])

    GO

    CREATE INDEX [SITE_NC_2] ON [dbo].[SITE]([Zoning_Dsg], [Site_Id])

    GO

    CREATE INDEX [SITE_NC_3] ON [dbo].[SITE]([Site_Id], [Zoning_Dsg])

    GO

    Haven't read enough on here to be absolutely sure what you are referring to in terms of sets (As you can see I am fairly new to posting here. I probably know what they are just not registering in my brain right now) .

  • Well, welcome aboard, Kevin! "Set-based" is nothing more than processing sets of information instead of processing "Row By Agonizing Row" (other wise known as "RBAR"... see my tag line below). Of course, there is some code that looks set based (no declared loop of any kind) that can actually be worse than a cursor. I refer to that stuff as hidden RBAR. Take a look at the following to see what I mean...

    http://www.sqlservercentral.com/articles/T-SQL/61539/

    Anyway, you spirit is in the right place with you posing code and timings and all. I'll be back with some code of my own...

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I apologize for jumping into the middle of this thread, but I must be missing something. What's wrong with

    "Select sum(SF) as TheAnswer from Site"?

    Or is the whole point to benchmark a cursor loop against a while loop?

    One note about benchmarking: Even if you have a dedicated server machine, be careful about comparing benchmark results based on differences in GetDate() values (subtracting start time from finish time). I call that "wall clock" benchmarking. It's OK for measuring big changes, but not super accurate if you're splitting hairs and totally innaccurate if anything else is going on on the server.

    In my opinion, it is much more accurate to start a clean login (new spid) for each test case, run the tests, and then use the SSMS's activity monitor or EM's process info to compare counts for cpu cycles, I/O, and perhaps memory used for each test.

    Andy

  • Oops, I meant

    "Select sum(site_frontage) as TheAnswer from Site"

  • Kevin Rathgeber (6/13/2008)


    ...(I know I could use sum(), I was just running a test on a large dataset)....

    OK, never mind. Looks like I might have been preaching to the choir! :w00t:

  • Oh definitely, the far preferred method would be to use the Select sum(fieldname) from table. The point of this whole thing is far greater than that.

    The code above is just a very basic example of something you "could" do in a loop or cursor, not necessarily something you should do . You can remove the line of code where I am summing the frontage value as that is just a place holder for some other piece of code you would typically run in a loop or cursor.

    Edit: I see you beat me to it. I too kind of jumped into the middle of this.

  • Sure, I just can't think of any code I'd run in a loop, cursor or otherwise! 😀

    I'm serious about my comment on "Wall Clock" benchmarking, though. I don't know how many times a novice developer has asked me if an index was missing because their RBAR-from-hell code took 10 times longer to run today than yesterday. I had to implement a home-grown cpu monitoring utility on our development server so I could show them that it took 10 times longer today because the guy sitting in the cube next to them was running their own RBAR-from-hell code at the same time.

    Of course I then sit them down and show them how the same thing can be accomplished 100 times faster than yesterday by using a set-based approach. Then they use that approach forever after. (I also believe in the Easter Bunny.) 😉

  • Thanks Jeff. That was a good overview on the RBAR concept. Lucky for me I have tried to avoid that as much as possible. To tell you the truth it just seemed logical.

    My favorite is when other developers I work with get a glazed over look when I mention a cartesian product to them.

    Have I avoided it in every case.....god no. I'm only human. I know there is some code out there where I have RBAR implemented.

  • I wanted to get some clearer results from what I had before. I am now on SQL 2005 Express. No one on the server but me. I am running each test 20 times and taking the average.

    I also removed the sum function so all that is left is the skeleton code for the cursor vs. the while loop.

    I ran the scenario several time over and the while loop always won out if I specified the entire recordset, but if I specified Top X then the Cursor always won out. The Top X was bugging me so I figured lets do this real world where you normally don't do a Top X. So I started specifying ranges of IDs using a between clause of differing sizes

    Here are my results against the same recordset I used earlier in the day (65528 rows)

    4966 Records

    Cursor: 211.35 ms

    While: 175.25 ms

    9926 Records

    Cursor: 394.85 ms

    While: 326.15 ms

    19892 Records

    Cursor: 829.85 ms

    While: 694.37 ms

    33527 Records

    Cursor: 1291.75 ms

    While: 1182.4 ms

    65528 (All) Records

    Cursor: 2499.80 ms

    While: 2342.05 ms

    So using more real world scenarios, the while always seems to win out though not by much.

    Jeff, does this seem right to you?

  • Ok... lemme demonstrate a couple of things... first, we need some data that everyone

    can use and we need lot's of it because these tests run so darned fast. Instead of

    using TOP in all the code examples, change the TOP in the following data generator

    code to match the size of the test you want... yep, I've included some extra columns

    because this is really a copy of my standard data generator and I thought I'd share it,

    as well... usually takes something less than 90 seconds to build a million rows. Details

    are where they belong... in comments in the code! 😉

    drop table jbmtest

    --===== Create and populate a 1,000,000 row test table.

    -- Column "RowNum" has a range of 1 to 100,000 unique numbers

    -- Column "SomeInt" has a range of 1 to 50,000 non-unique numbers

    -- Column "SomeLetters2" has a range of "AA" to "ZZ" non-unique 2 character strings

    -- Column "SomeMoney has a range of 0.0000 to 99.9999 non-unique numbers

    -- Column "SomeDate" has a range of >=01/01/2000 and <01/01/2010 non-unique date/times

    -- Column "SomeCSV" contains 'Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10'

    -- for all rows.

    -- Column "SomeHex12" contains 12 random hex characters (ie, 0-9,A-F)

    -- Jeff Moden

    SELECT TOP 1000000 --<<< LOOK! Change this number to change the size of the test

    SomeID = IDENTITY(INT,1,1),

    SomeInt = ABS(CHECKSUM(NEWID()))%50000+1,

    SomeLetters2 = CHAR(ABS(CHECKSUM(NEWID()))%26+65)

    + CHAR(ABS(CHECKSUM(NEWID()))%26+65),

    SomeCSV = CAST('Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10' AS VARCHAR(80)),

    SomeMoney = CAST(ABS(CHECKSUM(NEWID()))%10000 /100.0 AS MONEY),

    SomeDate = CAST(RAND(CHECKSUM(NEWID()))*3653.0+36524.0 AS DATETIME),

    SomeHex12 = RIGHT(NEWID(),12)

    INTO dbo.JBMTest

    FROM Master.dbo.SysColumns t1,

    Master.dbo.SysColumns t2 --Lack of join criteria makes this a CROSS-JOIN

    --===== A table is not properly formed unless a Primary Key has been assigned

    -- Takes about 1 second to execute.

    ALTER TABLE dbo.JBMTest

    ADD PRIMARY KEY CLUSTERED (SomeID)

    In case you didn't notice, we just eliminated a cursor or a while loop by using a

    constrained cross-join! Try building the same thing using a WHILE LOOP and see how

    long building a million rows of data takes! 😉

    Ok... just like before, we're just going to do a sum on a column. Yes, SUM() would be

    the way to do it but, just like before, we trying to give the code something very simple

    to do. Here's the cursor code... in the spirit of all that is "holy" in SQL, all fat has been

    trimmed out of the code... no extra columns are passed in.

    --===== Common

    DECLARE @TimeStart DATETIME

    --===== Cursor

    SET @TimeStart = GETDATE()

    DECLARE @SomeMoney MONEY,

    @SumSomeMoney MONEY

    DECLARE curTest CURSOR

    FOR

    SELECT TOP 100000

    SomeMoney

    FROM dbo.JbmTest

    OPEN curTest

    FETCH NEXT FROM curTest

    INTO @SomeMoney

    SELECT @SumSomeMoney = 0

    WHILE @@FETCH_STATUS = 0

    BEGIN

    SELECT @SumSomeMoney = @SumSomeMoney + ISNULL(@SomeMoney,0)

    FETCH NEXT FROM curTest

    INTO @SomeMoney

    END

    CLOSE curTest

    DEALLOCATE curTest

    SELECT @SumSomeMoney AS Total,

    CONVERT(VARCHAR(26),GETDATE()-@TimeStart,114) AS Duration

    If you run that over and over for a million rows, you'll get an average close to this

    (total will vary because of the random nature of the numbers)...

    [font="Courier New"]Total Duration

    --------------------- --------------------------

    49955839.1400 00:00:36:267[/font]

    Let's do the same thing with a Temp table and While loop following the same rules...

    --===== Common

    DECLARE @TimeStart DATETIME

    SET NOCOUNT ON

    --===== While Loop

    SET @TimeStart = GETDATE()

    DECLARE @SumSomeMoney MONEY,

    @MyCount INT,

    @Counter INT

    SELECT IDENTITY(INT,1,1) AS RowNum,

    SomeMoney

    INTO #MyHead

    FROM dbo.JbmTest

    SELECT @MyCount = @@ROWCOUNT

    ALTER TABLE #MyHead

    ADD PRIMARY KEY CLUSTERED (RowNUM) WITH FILLFACTOR = 100

    SELECT @SumSomeMoney = 0,

    @Counter = 1

    WHILE @Counter <= @MyCount

    BEGIN

    SELECT @SumSomeMoney = @SumSomeMoney + ISNULL(SomeMoney,0)

    FROM #MyHead

    WHERE RowNum = @Counter

    SELECT @Counter = @Counter + 1

    END

    DROP TABLE #MyHead

    SELECT @SumSomeMoney AS Total,

    CONVERT(VARCHAR(26),GETDATE()-@TimeStart,114) AS Duration

    [font="Courier New"]Total Duration

    --------------------- --------------------------

    49955839.1400 00:00:30:453[/font]

    Oh my goodness! The Temp Table and While Loop method ran a good 6 seconds

    faster! (Don't even think of running it without that primary key unless you also like

    watching grass grow...). So THAT's why everyone says a WHILE LOOP is faster that a

    CURSOR...

    Well, it's just not true folks. Run the following cursor code... the ONLY thing I've

    changed is in RED...

    --===== Common

    DECLARE @TimeStart DATETIME

    SET NOCOUNT ON

    --===== Cursor

    SET @TimeStart = GETDATE()

    DECLARE @SomeMoney MONEY,

    @SumSomeMoney MONEY

    DECLARE curTest CURSOR FAST_FORWARD --"Firehose" cursor

    FOR

    SELECT SomeMoney

    FROM dbo.JbmTest

    OPEN curTest

    FETCH NEXT FROM curTest

    INTO @SomeMoney

    SELECT @SumSomeMoney = 0

    WHILE @@FETCH_STATUS = 0

    BEGIN

    SELECT @SumSomeMoney = @SumSomeMoney + ISNULL(@SomeMoney,0)

    FETCH NEXT FROM curTest

    INTO @SomeMoney

    END

    CLOSE curTest

    DEALLOCATE curTest

    SELECT @SumSomeMoney AS Total,

    CONVERT(VARCHAR(26),GETDATE()-@TimeStart,114) AS Duration

    [font="Courier New"]Total Duration

    --------------------- --------------------------

    49955839.1400 00:00:31:810[/font]

    Well, wadaya know... almost as fast as the While loop. And a lot less effort.

    Here's the timings I got for all 3 methods across a range of row counts... it was done

    on a single user server with it's own hard disks... nothing is shared...

    [font="Courier New"]Rows Cursor While Loop FH Cursor

    10 0.000 0 - 0.017 0.000

    100 0.000 0.017 0.000

    1000 0.033 0.063 0.033

    10000 0.390 0.343 0.297

    100000 3.563 3.327 3.203

    1000000 36.267 30.453 31.810[/font]

    Notice that both cursor types are faster up to the 1000 mark and that the Fire hose

    cursor is faster than the While Loop all the way up past 100,000 rows...

    In other words, from these simple tests which measure really base performance of the

    3 methods, converting Cursors into While Loops is a bit like attending a charity orgy to

    support virginity.

    Oh, yeah... let's do the set based equivalent and see what happens...

    [font="Courier New"]Rows Cursor While Loop FH Cursor Set Based

    10 0.000 0 - 0.017 0.000 0.000

    100 0.000 0.017 0.000 0.000

    1000 0.033 0.063 0.033 0.000

    10000 0.390 0.343 0.297 0.000

    100000 3.563 3.327 3.203 0.063

    1000000 36.267 30.453 31.810 0.673[/font]

    I'm thinking that we're pretty much destroyed the myths the either the cursor or the

    temp table with a while loop is faster than the other. They both suck almost equally

    bad especially when the number of rows starts to get up there a bit.

    Oh, yeah... what's the set based solution that I used? Why, the basis of all intelligent

    SQL programming... we don't normally use a variable as the target of the code, but here it is...

    --===== Set Based "Loop"

    DECLARE @TimeStart DATETIME

    SET @TimeStart = GETDATE()

    DECLARE @SomeMoney MONEY,

    @SumSomeMoney MONEY

    SELECT @SumSomeMoney = 0

    SELECT @SumSomeMoney = @SumSomeMoney + ISNULL(SomeMoney,0)

    FROM dbo.jbmTest

    SELECT @SumSomeMoney AS Total,

    CONVERT(VARCHAR(26),GETDATE()-@TimeStart,114) AS Duration

    Between the "Set Based 'Loop'", the constrained Cross Join, and the Tally table, you

    can get around the need for just about any form of RBAR you can think of. Even a

    simple UPDATE can be used to take the place of many forms of RBAR. Take the

    necessarily procedural code for doing a running total... if you think that the only way

    to do one is with a cursor or while loop or you've seen the supposedly set based

    but performance challenged methods that use a triangular join, then you haven't seen

    the following article...

    http://www.sqlservercentral.com/articles/Advanced+Querying/61716/

    Now THAT's how you eliminate a cursor! 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Kevin Rathgeber (6/13/2008)


    So using more real world scenarios, the while always seems to win out though not by much.

    Jeff, does this seem right to you?

    I'd pretty much agree depending on the machine you're using. As you just saw in the testing I did, they both take turns winning and one never wins over the other by very much.

    There are rare exceptions, but, for the most part, RBAR is a virtual guarantee of poor performance.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks Jeff for that excellent in depth explanation.

    As I explained to you privately my own personal brain fart made me miss an important piece of working with sets. Understanding that blunder, now when I think of it, it is a no brainer.

    Thanks again.

  • Thanks for the feedback, Kevin... and thanks for the code you posted. There are only a few people who will take the time to actually do performance tests never mind posting the results and the code they used. Very happy to have another one of the "good guys" join the forum!

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • You guys REALLY need to listen up when TheSQLGuru speaks!! 😀

    1) The first post I did on this thread way back requested someone test the result with the fastest cursor definition possible: DECLARE curTest CURSOR FORWARD_ONLY READ_ONLY STATIC LOCAL. No one did that, despite Jeff running quite the suite of tests using FAST_FORWARD - which is now proven to NOT be the fastest cursor method available. See here: http://sqlblog.com/blogs/hugo_kornelis/archive/2007/11/21/curious-cursor-optimization-options.aspx.

    On my laptop the above declaration for the cursor came in at 14.6 seconds, the fast_forward cursor was 17.0 sec avg with the temp table/while loop construct taking 14.8 sec avg. So my cursor still beats the while loop, even at 1M rows.

    2) it is just plain silly in this case to put the data into a temp table to iterate the while loop over. Just use SomeID as the driver value. This code ran in 10.9 sec avg, or almost 1/3 faster:

    DECLARE @TimeStart DATETIME

    SET NOCOUNT ON

    --===== While Loop

    SET @TimeStart = GETDATE()

    DECLARE @SumSomeMoney MONEY,

    @MyCount INT,

    @Counter INT

    SELECT @MyCount = count(*) from dbo.JbmTest

    SELECT @SumSomeMoney = 0,

    @Counter = 1

    WHILE @Counter <= @MyCount

    BEGIN

    SELECT @SumSomeMoney = @SumSomeMoney + ISNULL(SomeMoney,0)

    FROM dbo.JbmTest

    WHERE SomeID = @Counter

    SELECT @Counter = @Counter + 1

    END

    SELECT @SumSomeMoney AS Total,

    CONVERT(VARCHAR(26),GETDATE()-@TimeStart,114) AS Duration

    3) I also tried a TOP 1 implementation which I got to run in 11.5 sec avg - still a good bit faster than the cursors but slower than direct access. Note that I didn't debug this just slapped it together so not sure if it works right. Gotta run take care of baby!

    DECLARE @TimeStart DATETIME

    SET NOCOUNT ON

    --===== While Loop

    SET @TimeStart = GETDATE()

    DECLARE @SumSomeMoney MONEY, @currentmoney money,

    @currentrow INT

    SELECT @SumSomeMoney = 0

    SELECT top 1 @currentrow = someid,

    @currentmoney = somemoney

    from dbo.JbmTest

    order by someid

    WHILE @@ROWCOUNT > 0

    BEGIN

    SET @SumSomeMoney = @SumSomeMoney + ISNULL(@currentmoney,0)

    SELECT top 1 @currentrow = someid,

    @currentmoney = somemoney

    from dbo.JbmTest

    WHERE SomeID > @currentrow

    order by someid

    END

    SELECT @SumSomeMoney AS Total,

    CONVERT(VARCHAR(26),GETDATE()-@TimeStart,114) AS Duration

    4) And if you REALLY want row-by-row stuff to be fastest, you need to step outside TSQL altogether. CLR RBAR blows the doors off of native cursors. http://sqlblog.com/blogs/hugo_kornelis/archive/2007/11/28/poor-men-see-sharp-ndash-more-cursor-optimization.aspx

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

Viewing 15 posts - 166 through 180 (of 296 total)

You must be logged in to reply to this topic. Login to reply