There Must Be 15 Ways To Lose Your Cursors… Part 2: Just Put It in a S

  • min.li (4/28/2009)


    Thanks Samuel and Peter.

    The real environment where this function to be used, the 'TestData' table is not really large. Or to be exact, the TestData table is large, but this function only work on a subset of the TestData table each time in sequence, the size of the subset is defined by a pass in parameter 'PageSize', which will not be over 1000 maximum. In this context, I think the cursor solution is still suitable, as scaling is not such important factor to consider.

    That may be true today, but it may not be true in the future.

    Can you guarantee that the table will never get a lot larger than it is today?

    Can you guarantee that you will never have to operate on more than 1000 rows at a time?

    I think not. You are making a mistake. Scalability is ALWAYS important when you are talking about databases.

    And, I have to say that the cursor solution is a lot easier to read and understand for me as the solutions that you two have offered are quite complex to me at the moment, I will not be able to work it out by myself. So I still think cursor is appropriate in such circumstances.

    Basically you are saying that it is appropriate to use a cursor because you don't understand the other (better) solutions.

    That's like saying you're always going to ride a bike because you don't know how to drive a car.

    Don't you think it would make more sense to work on understanding those solutions?

  • min.li (4/28/2009)


    Thanks Samuel and Peter.

    The real environment where this function to be used, the 'TestData' table is not really large. Or to be exact, the TestData table is large, but this function only work on a subset of the TestData table each time in sequence, the size of the subset is defined by a pass in parameter 'PageSize', which will not be over 1000 maximum. In this context, I think the cursor solution is still suitable, as scaling is not such important factor to consider. And, I have to say that the cursor solution is a lot easier to read and understand for me as the solutions that you two have offered are quite complex to me at the moment, I will not be able to work it out by myself. So I still think cursor is appropriate in such circumstances.

    min.li,

    You are likely to cause a stir here at SQL Central by your reasoning as to why to stick to a cursor based approach. Not that it is our business to tell you what to do, but the very purpose of this article series is to show people that there are better alternatives to using cursors for (nearly?) everything. The reason cursors remain out there as much as they are, while not the best tool for any job by far, is a lack of SQL knowledge.

    The most obvious solution is to fix this lack of SQL knowledge, hence articles like these. While I understand your reasoning I do feel you are giving up too quickly. The code I provided obviously contains some new constructs that are alien to you now. But the best course of action is to spend some time understanding them and apply them in everyday work when applicable. You will find that a lot of difficult problems become less difficult as your toolset increases.

    My strong sentiment is that leaving cursors out there in the field is exactly what causes new people to quickly adopt them even when they are not the right tool for the job. It’s a self reinforcing process that is counterproductive and thus must be broken. Articles like these have this in mind and by helping and educating as many people as possible to let them see better ways of solving SQL problems. Success here is the willingness of the people we try to help to drop resistance to understand and learn the better solutions. I hope you will try and learn why the code works well, so you can use it on your next problem instead of falling back on the familiar cursor construct.

    Personally I believe the best people over here only became that by learning all the time and adopting new constructs when proven that these work better then what they knew before. What I am certain off is that none achieved their level by giving up or sticking to proven flawed constructs. SSC is an ideal place to learn because quite a number of experienced people here spend a large amount of time analyzing and solving problems presented to us in order to help others and to bring the collective level of SQL knowledge to a higher level.

  • CROSS APPLY isn't much more than a fancy correlated sub-query with all of the same advantages and disadvantages...

    Thanks Jeff and Lynn. I knew that CROSS APPLY wouldn't be better than re-writing everything. I just wondered if it would be better than a WHILE loop. And since you can't call stored procs within a table valued function, then the idea is moot anyway.

    ASIDE: If not for these kinds of scenarios, it makes me wonder why they came up with the CROSS APPLY at all. Other than the occasional query that is using DMVs, I haven't seen any cases where doing CROSS APPLY makes sense. And perhaps it makes sense in queries using DMVs because MS set it up that way. This is off topic, so I won't speculate any more.

  • JJ B (4/28/2009)


    CROSS APPLY isn't much more than a fancy correlated sub-query with all of the same advantages and disadvantages...

    Thanks Jeff and Lynn. I knew that CROSS APPLY wouldn't be better than re-writing everything. I just wondered if it would be better than a WHILE loop. And since you can't call stored procs within a table valued function, then the idea is moot anyway.

    ASIDE: If not for these kinds of scenarios, it makes me wonder why they came up with the CROSS APPLY at all. Other than the occasional query that is using DMVs, I haven't seen any cases where doing CROSS APPLY makes sense. And perhaps it makes sense in queries using DMVs because MS set it up that way. This is off topic, so I won't speculate any more.

    Well, where to start... 🙂

    For one they are the ONLY means by which a table valued function can be incorporated in a from clause when using dynamic parameters. Thus those invokations depending on values found in records of tables that are already present/joined in the from clause.

    This ads the option to encapsulate complex logic in an inline table valued function and hide this complexitiy, execute it just one time per record while using the result multiple times inside the select or where part. Or even short circuit the resultset to deal with a smaller set in subsequent joins.

    I love cross apply, really 🙂

  • My previous post was actually trying to ask if cursor still has its use in certain circumstances. And to my particular case, it does weigh performance over scaling, and the current pagesize is 100 and it is very unlikely go beyond 1000 due to other factors.

    It is not that I do not want to learn which I do not familiar, or I won't be here. I am just want to ask if cursor can be considered in certain circumstances or should be got rid of completely. And now your answer seems to me a definitely YES.

  • Andy DBA (4/27/2009)


    greatheep (4/27/2009)


    So, the gauntlet is down - any suggestions on how to achieve the same benefits using set processing?

    In summary, we want:

    - If there is an error in a row, none of the various updates/inserts that are done should be kept - all should be rolled back.

    - Any rows which have no error should process completely, committing changes to the database.

    - We should be able to tell after the fact which rows processed completely, and which rows have an error, as well as the specific error each row experienced (they could be different for each row in error)

    I agree with other replies, without specifics it's hard to provide an exact answer, but I think a general approach can be suggested.

    It sounds like your staging table already has an error column. For the following general approach, you may want to either add a column for each possible error condition or create an error table with a many to one for each staging row. Which one you choose depends on the number of possible errors, what your reporting requirements are, etc.

    The overview for a set based solution is quite simple (especially because I don't know your specific requirements ;-)) and you probably already have many error checking code snippets in your cursor loop that you can re-use. Basically you run a series of set based updates on the entire staging table (or related error table) to flag the rows that match each error condition. Then you process all rows that don't have error flags set and report on the ones that do.

    Obviously this will not help for cases where you have concurrency issues because you can't screen for that. Once approach for concurrency problems would be to break the set of good rows into "chunks" and lump all inserts and the update which marks the rows as "complete" into the same transaction. If anything "errors out", the whole transaction will be rolled back. I'm sure you'll get lots of differing opinions on that approach, though. :w00t:

    Hopefully set based updates won't present as big a target for blocking locks, but depending on your server and other concurrent processes, 1.5 million rows at a time might be a big bite to chew.

    I have a procedure that does a similiar processing. We have a staging table, the procedure reads the row into a cursor and then loops through a row at a time and performs a series of validations against the data. If it finds an error, it writes out to an error log, skips the rest of the processing and continues to the next record.

    When I initially created the procedure, I considered doing it with a series of set based queries. However, intuitively it seems less productive to me. I have a hard time believing that doing dozens of queries against large tables is less resource intensive than doing one query and looping though. I'm fairly new at SQL and haven't actually tested one approach against the other so maybe I'm wrong.

    Another approach that I considered taking was doing the validation in a single query using a where clause that would contain all the various combinations of invalid data and a complex case statement that would return the correct error. However, I believe that if I did it this way I would end up with a query so complicated it would be very difficult for some else to figure out what it was attempting to do.

    I also considered a combination of the two approaches, do several moderately complex queries that each handle three or four different validations.

    In the end, I decided that the cursor approach was probably both more efficient and more readable/maintainable. I could just be showing my inexperience here and I'm definitely open to be shown the errors of my way.

  • What is counter intuitive to people new to SQL Server is that each statement in a batch carries quite a large overhead. It’s an overhead that is reasonable small in most procedural scripting languages and as good as absent in a fully compiled one. This makes it quite natural for people with a procedural language background to misjudge the speed of procedural code in SQL.

    Obviously cursors using the declare cursor construct are by their very nature procedural and thus each step of the processing incurs this penalty. On top of this there is the resulting case of iteratively accumulating a final or intermediate result set, record by record, something referred to as RBAR (Row By Agonizing Row, © by Jeff Moden). Once you realize that each added record is basically a complete new set operation by itself you start to see how obvious it is for procedural looping code in a database to be severely flawed.

    In a database you “interface” with tables and doing operations on those record by record while needing the same resources again and again for each insert quickly adds up. Thus individual inserts on a 100 record table uses resources (locks, indexes) 10 times as frequent then a single insert of 10 records would. The problem becomes bigger as the table to operate on increases in size or when a table is indexed more heavily.

    What might seem acceptable when testing with 4 people on one server will totally lock down when used on a production sever when the load is 10 fold or more. Unfortunately at this point better hardware cannot solve the problem for you, only lessen the damage a (very) little. Add in the mix that software is frequently used more intense as originally anticipated due to added reports, more features or organizational changes and you see how volatile a situation you get when most of your database code is written procedurally.

  • I see several broken guantlets on the ground. As I asked for before, give us requirements, DDL for the staging tables and destination tables, processing rules for the data, sample data in a consumable format for both the staging table and the destination tables (I assume that one of the errors that can occur is that the data already exists in the destination table(s)), and your current cursor based code.

    Edit: We also need the expected results based on the sample data.

    Put all of this in a separate thread, then post a link in this thread to your new thread.

    Give us a chance to demonstrate that cursors aren't necessarily required. If we are wrong, we'll publicly admit that a cursor was a better option in your case if that turns out to be true.

    Personally, without seeing all the above, I am confident we will provide you with a scalable set-based solution that will blow the doors off your cursor based solutions.

  • peter (4/28/2009)


    Once you realize that each added record is basically is a complete new set operation by itself you start to see how obvious it is for procedural looping code in a database to be severely flawed.

    In a database you “interface” with tables and doing operations on those record by record while needing the same resources again and again for each insert quickly adds up. Thus individual inserts on a 100 record table uses resources (locks, indexes) 10 times as frequent then a single insert of 10 records would. The problem becomes bigger as the table to operate on increases in size or when a table is indexed more heavily.

    I didn't realize that each row was actually doing a new set operation. In that case, I can definitely see how doing it with a series of set operation would be much more efficient.

    In my particular case, the insert operations isn't as big of an issue. Even though I am processing a large table there normally aren't more than few errors each time so I'm not taking a huge hit on this. Of course, it would be better to do them in one operation.

  • Lynn Pettis (4/28/2009)


    I see several broken guantlets on the ground.

    Peter's explaination has already convinced me that set operations would be better. In my case, it wasn't a matter of me not knowing how to do it with sets, I just didn't think it would be as efficient. I didn't go through the trouble of including exact examples because I wanted to leave it generic. I am sure a lot of people are doing similar processing; validating data from an outside source.

    Some of my validations are actually done by stored procedures that are designed to work on individual rows but since they are my stored procedures they can easily be changed to use sets.

  • tpepin (4/28/2009)


    Lynn Pettis (4/28/2009)


    I see several broken guantlets on the ground.

    Peter's explaination has already convinced me that set operations would be better. In my case, it wasn't a matter of me not knowing how to do it with sets, I just didn't think it would be as efficient. I didn't go through the trouble of including exact examples because I wanted to leave it generic. I am sure a lot of people are doing similar processing; validating data from an outside source.

    Some of my validations are actually done by stored procedures that are designed to work on individual rows but since they are my stored procedures they can easily be changed to use sets.

    Yours isn't the only guantlet I am talking about. There are several others out there. Based on the above, I'm thinking you have seen the light based on Peter's explaination and are working on it yourself (any problems, though, please don't hesitate to post a new thread and ask for help!). That is commendable, and I think you will find it well worth the effort.

  • Stored procedures aren't bad when they are used for guiding a procedure of a series of actions; it is what they are for basically. What is bad is procedural LOOPING code (while etc.) that claims locks and/or does repeatedly similar accesses within a loop.

    This naturally includes cursors.

    A stored procedure like everything else in SQL should be written to minimize resource use and the amount of time it needs to lock said resources. A stored procedure full of set based code, say a series of set based operations leading to the desired result, is perfectly fine.

  • min.li (4/28/2009)


    My previous post was actually trying to ask if cursor still has its use in certain circumstances. And to my particular case, it does weigh performance over scaling, and the current pagesize is 100 and it is very unlikely go beyond 1000 due to other factors.

    It is not that I do not want to learn which I do not familiar, or I won't be here. I am just want to ask if cursor can be considered in certain circumstances or should be got rid of completely. And now your answer seems to me a definitely YES.

    Actually, the people here are telling you NO.

    This is NOT a case where a cursor should be used.

    There may be a few cases where cursors can be considered, but yours is not one of them.

    You will get better performance AND scalability with either of the other solutions offered.

    If you insist on using a cursor, then it sounds to the people who are trying to help you that you don't want to learn.

  • peter (4/28/2009)


    Stored procedures aren't bad when they are used for guiding a procedure of a series of actions; it is what they are for basically. What is bad is procedural LOOPING code (while etc.) that claims locks and/or does repeatedly similar accesses within a loop.

    This naturally includes cursors.

    A stored procedure like everything else in SQL should be written to minimize resource use and the amount of time it needs to lock said resources. A stored procedure full of set based code, say a series of set based operations leading to the desired result, is perfectly fine.

    In my particular case, the stored procedures are all set based. It is just that I am passing a single key value to them that I use to join to other tables. I am mostly using them for code reusability and to hide the detail from the calling procedure to make them more readable.

  • tpepin (4/28/2009)


    In the end, I decided that the cursor approach was probably both more efficient and more readable/maintainable. I could just be showing my inexperience here and I'm definitely open to be shown the errors of my way.

    Lynn and Peter have already addressed efficiency issues, but I think tpepin's concerns about readability and maintainability are important.

    However, with good formatting, (and proper use of stored procedures) you can still have readability and maintainability on quite complicated set based operations. I've seen well commented where clauses that might take up 10 or 20 lines, but it is very easy to pinpoint and maintain each condition being checked. I've also seen fairly simple queries that were almost impossible to read without breaking them apart first in an editor.

    I think suggestions for proper formatting and good comments are left out of many discussions and examples because the focus is more on performance (rightly so) and also because nobody wants to type in 100 lines when 10 will prove their point. I would never recommend sacrificing performance for readability, but queries and sps will perform just as well when peppered with comments and spread out over many lines.

    Hopefully nobody out there is choosing an inefficient approach to their work just because they think it will be easier for the next developer to understand and maintain.

Viewing 15 posts - 91 through 105 (of 316 total)

You must be logged in to reply to this topic. Login to reply