There Must Be 15 Ways To Lose Your Cursors… Part 2: Just Put It in a S

  • Andy DBA (4/28/2009)


    I think suggestions for proper formatting and good comments are left out of many discussions and examples because the focus is more on performance (rightly so) and also because nobody wants to type in 100 lines when 10 will prove their point. I would never recommend sacrificing performance for readability, but queries and sps will perform just as well when peppered with comments and spread out over many lines.

    Hopefully nobody out there is choosing an inefficient approach to their work just because they think it will be easier for the next developer to understand and maintain.

    You point about formatting and commenting are well noted. They will both do wonders for making the code maintainable, but I also believe that very nature of programming leads to trade offs between efficiency and maintainability.

    You are normally are taking a complex task and breaking it into manageable step. This normally leads to some inefficiency. Its the programmers job to find the correct balance.

    My original problem was a case in point; do I do each validation with a separate query; do I attempt to do it with a one very complicated query or do I do it with a limited number of fairly complex queries. I think a valid case can be made for each approach.

  • I the kind of data cleaning that you're talking about, I generally break it up into separate tests. Just make each test work on the whole table, or applicable subset of the table, all at once, instead of one row at a time.

    For example, let's say you have a test to make sure that the data in the DateEntered field is actually a valid date. You can run that test one row at a time, or you can do something like:

    Insert into dbo.ErrorRecords (RecordID, ErrorCode)

    select RecordID, 1 as ErrorCode

    from dbo.MyTable

    where IsDate(DateEntered) = 0;

    That single-pass operation is going to be a lot more efficient than checking row-by-row.

    I think that's all we're talking about here. Right?

    The advantage to splitting each test off into its own proc is that the master proc doesn't have to then store an execution plan that's complex enough to handle all the cases in all the tests. Splitting it up and having one master that calls one sub per test, can result in significant improvements in efficiency and speed, because of that factor.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • tpepin (4/28/2009)


    Andy DBA (4/28/2009)


    I think suggestions for proper formatting and good comments are left out of many discussions and examples because the focus is more on performance (rightly so) and also because nobody wants to type in 100 lines when 10 will prove their point. I would never recommend sacrificing performance for readability, but queries and sps will perform just as well when peppered with comments and spread out over many lines.

    Hopefully nobody out there is choosing an inefficient approach to their work just because they think it will be easier for the next developer to understand and maintain.

    You point about formatting and commenting are well noted. They will both do wonders for making the code maintainable, but I also believe that very nature of programming leads to trade offs between efficiency and maintainability.

    You are normally are taking a complex task and breaking it into manageable step. This normally leads to some inefficiency. Its the programmers job to find the correct balance.

    My original problem was a case in point; do I do each validation with a separate query; do I attempt to do it with a one very complicated query or do I do it with a limited number of fairly complex queries. I think a valid case can be made for each approach.

    But using a cursors means checking each row one at a time. If it takes 100 ms to check 1 record, how long will it take to check 1,000,000 records? I'll let you do the math.

    At this point you might say, "but I'm only processing 1,000 records." that may be true now, but what about 3 months from now, or 12 months, or 36 months? What happens if the volume increases 100 or 1000 fold? Your cursor solution won't scale with the increase in volume.

    You do need to plan for that which is not foreseen. The requirement that will never change will change.

  • Lynn Pettis (4/28/2009)


    tpepin (4/28/2009)


    Andy DBA (4/28/2009)


    I think suggestions for proper formatting and good comments are left out of many discussions and examples because the focus is more on performance (rightly so) and also because nobody wants to type in 100 lines when 10 will prove their point. I would never recommend sacrificing performance for readability, but queries and sps will perform just as well when peppered with comments and spread out over many lines.

    Hopefully nobody out there is choosing an inefficient approach to their work just because they think it will be easier for the next developer to understand and maintain.

    You point about formatting and commenting are well noted. They will both do wonders for making the code maintainable, but I also believe that very nature of programming leads to trade offs between efficiency and maintainability.

    You are normally are taking a complex task and breaking it into manageable step. This normally leads to some inefficiency. Its the programmers job to find the correct balance.

    My original problem was a case in point; do I do each validation with a separate query; do I attempt to do it with a one very complicated query or do I do it with a limited number of fairly complex queries. I think a valid case can be made for each approach.

    But using a cursors means checking each row one at a time. If it takes 100 ms to check 1 record, how long will it take to check 1,000,000 records? I'll let you do the math.

    At this point you might say, "but I'm only processing 1,000 records." that may be true now, but what about 3 months from now, or 12 months, or 36 months? What happens if the volume increases 100 or 1000 fold? Your cursor solution won't scale with the increase in volume.

    You do need to plan for that which is not foreseen. The requirement that will never change will change.

    You have all convinced me that the cursor method is definitely not the way to do it.

    The only thing we are debating now are the merits of doing the processing with one set operations as opposed to breaking it down into a series of set operations, but we are probably getting off topic for this thread.

  • tpepin (4/28/2009)


    Lynn Pettis (4/28/2009)


    tpepin (4/28/2009)


    Andy DBA (4/28/2009)


    I think suggestions for proper formatting and good comments are left out of many discussions and examples because the focus is more on performance (rightly so) and also because nobody wants to type in 100 lines when 10 will prove their point. I would never recommend sacrificing performance for readability, but queries and sps will perform just as well when peppered with comments and spread out over many lines.

    Hopefully nobody out there is choosing an inefficient approach to their work just because they think it will be easier for the next developer to understand and maintain.

    You point about formatting and commenting are well noted. They will both do wonders for making the code maintainable, but I also believe that very nature of programming leads to trade offs between efficiency and maintainability.

    You are normally are taking a complex task and breaking it into manageable step. This normally leads to some inefficiency. Its the programmers job to find the correct balance.

    My original problem was a case in point; do I do each validation with a separate query; do I attempt to do it with a one very complicated query or do I do it with a limited number of fairly complex queries. I think a valid case can be made for each approach.

    But using a cursors means checking each row one at a time. If it takes 100 ms to check 1 record, how long will it take to check 1,000,000 records? I'll let you do the math.

    At this point you might say, "but I'm only processing 1,000 records." that may be true now, but what about 3 months from now, or 12 months, or 36 months? What happens if the volume increases 100 or 1000 fold? Your cursor solution won't scale with the increase in volume.

    You do need to plan for that which is not foreseen. The requirement that will never change will change.

    You have all convinced me that the cursor method is definitely not the way to do it.

    The only thing we are debating now are the merits of doing the processing with one set operations as opposed to breaking it down into a series of set operations, but we are probably getting off topic for this thread.

    Series of set operations will almost certainly end up being better.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • I'll have to agree with Gus on this, but it would help to see the current code to really make a determination.

  • This was what I intended for the problem on page 8 which happens to look much the same as Samuel Vella's solution.

    SET ROWCOUNT 5000;

    SELECT IDENTITY(INT,0,1) AS id INTO #Tally FROM sysobjects a, sysobjects b;

    SET ROWCOUNT 0;

    DECLARE @data TABLE (row INT IDENTITY PRIMARY KEY, name CHAR(2));

    INSERT INTO @data (name) SELECT CASE WHEN id % 10 < 5 THEN 'RL' ELSE 'SL' END

    FROM #Tally ORDER BY id;

    DROP TABLE #Tally;

    WITH EndData (name, beginrow, endrow, sort ) AS

    (SELECT data.name, 0, data.row, ROW_NUMBER () OVER (ORDER BY data.row)

    FROM @data data LEFT JOIN @data data2 ON data.row = data2.row - 1

    WHERE data.name data2.name OR data2.name IS NULL

    ),

    BeginData (name, beginrow, endrow, sort ) AS

    (SELECT data.name, data.row, 0, ROW_NUMBER () OVER (ORDER BY data.row)

    FROM @data data LEFT JOIN @data data2 ON data.row = data2.row + 1

    WHERE data.name data2.name OR data2.name IS NULL

    )

    SELECT begindata.name, begindata.beginrow, enddata.endrow FROM BeginData JOIN EndData ON BeginData.sort = EndData.sort

    ORDER BY BeginData.beginrow;

    When testing, it did not scale too well on our server:

    [font="Courier New"]

    CpuMs LogRds Elapsed

    100 rows:

    0 883 10

    500 rows:

    125 3613 120

    1000 rows:

    484 7054 442

    2000 rows:

    1748 22339 1718

    3000 rows:

    3791 46001 4001

    5000 rows:

    10577 117645 10651

    Etcetera...

    [/font]

    Did I do something wrong, or is it unwise to use a table variable for such operations? Or, eh..., both?


    Dutch Anti-RBAR League

  • Seems to me a series of procs called by a master proc would be better as that way if/when the data changes, you only have to add a new proc to test for a new error or delete a proc from the Master proc if that test is no longer viable. Saves having to rewrite the whole code if your data changes just slightly.

    -- Kit

  • It's just a way to populate a table/temp table/etc... with some values.

    http://www.sqlservercentral.com/articles/TSQL/62867/


    Dutch Anti-RBAR League

  • gserdijn (4/28/2009)


    This was what I intended for the problem on page 8 which happens to look much the same as Samuel Vella's solution.

    SET ROWCOUNT 5000;

    SELECT IDENTITY(INT,0,1) AS id INTO #Tally FROM sysobjects a, sysobjects b;

    SET ROWCOUNT 0;

    DECLARE @data TABLE (row INT IDENTITY PRIMARY KEY, name CHAR(2));

    INSERT INTO @data (name) SELECT CASE WHEN id % 10 < 5 THEN 'RL' ELSE 'SL' END

    FROM #Tally ORDER BY id;

    DROP TABLE #Tally;

    WITH EndData (name, beginrow, endrow, sort ) AS

    (SELECT data.name, 0, data.row, ROW_NUMBER () OVER (ORDER BY data.row)

    FROM @data data LEFT JOIN @data data2 ON data.row = data2.row - 1

    WHERE data.name data2.name OR data2.name IS NULL

    ),

    BeginData (name, beginrow, endrow, sort ) AS

    (SELECT data.name, data.row, 0, ROW_NUMBER () OVER (ORDER BY data.row)

    FROM @data data LEFT JOIN @data data2 ON data.row = data2.row + 1

    WHERE data.name data2.name OR data2.name IS NULL

    )

    SELECT begindata.name, begindata.beginrow, enddata.endrow FROM BeginData JOIN EndData ON BeginData.sort = EndData.sort

    ORDER BY BeginData.beginrow;

    When testing, it did not scale too well on our server:

    [font="Courier New"]

    CpuMs LogRds Elapsed

    100 rows:

    0 883 10

    500 rows:

    125 3613 120

    1000 rows:

    484 7054 442

    2000 rows:

    1748 22339 1718

    3000 rows:

    3791 46001 4001

    5000 rows:

    10577 117645 10651

    Etcetera...

    [/font]

    Did I do something wrong, or is it unwise to use a table variable for such operations? Or, eh..., both?

    Table variables are not a good choice when you have a large number of records. The query optimizer assumes that the number of records in a table variable is alsways 1 so you won't get an efficient query plan if you have a large number of records. Based on the query above, you probably should have used a properly indexed temporary table instead.

  • [font="Verdana"]There's a point I'd like to raise.

    When you are looking at something like a cursor and seeing how well it performs over a small number of rows, it's easy to think "well, that seems to be the best approach then."

    There's an inherent assumption here: that your code will be the only code performing at the time. So how do cursors perform when you have 10 users running the same code at the same time? Or 100?

    Just like many new database programmers use cursors (or equivalents) because they are familiar with procedural logic, so do they also assume that a test on an effectively single user system is enough.

    When you are developing database code, you need to ask some questions of the code:

    1. What is the life time of this code? If it's a throw away solution, then who cares. But if it's not, the chances are your code will be around for years. And if so, it's likely to be amended and reused.

    2. What is the scalability of the code? In terms of data growth and parallelism and concurrency?

    Against both of these questions, cursors fair poorly. So it's not enough to say "oh, I tried my cursor code with 100 rows, and it works fine thank you." What is the impact of introducing that code into a production system? Now, 1 year from now, 10 years from now?

    That's why it's a good idea to learn the techniques that scale well. Get in the habit of using them. Practice and refine. Cursors are like comfort blankets... at some point they are best left behind.

    [/font]

  • Hey, you leave my blankey out of it... 😉

    😎 Kate The Great :w00t:
    If you don't have time to do it right the first time, where will you find time to do it again?

  • Heaps of thanks,

    I have a cursor inside a trigger, and I always thought Cursor are from the word 'Curse'.

    I had a stored proc written with the C word which I converted, after reading your third part I'm going to deal with my Cursor in a Trigger problem.

    once again, Heaps of Thanks, your place in heaven is guaranteed.

  • Hi Barry,

    Been following your article and it really is nice...

    I have a case though where I am not able to remove the cursor/ or iteration approach for that matter.. Can you suggest something?

    I'll give a dummy scenario...

    SHEMA INFORMATION:

    We have the following tables in database (say : MasterDatabase)

    1. a user table containing data about the users...

    2. a project table listing the various projects

    3. a link table say user_projects linking the users to projects on a time basis ( i.e. from 'date' to 'date')

    Since the data is huge there are regular backups of the user's Performance data and stored in separate databases say "PerformanceHistoryDB1", "PerformanceHistoryDB2", ""PerformanceHistoryDB3" and so on...

    When this backup occurs the user_projects table is updated to store the corresponding database reference in a column say BackupDBID in user_projects table for each row where the corresponding performance data can be found (for a user on a particular project within a time frame)

    All these history databases have the same schema. In particular, there is a table

    Performancedata within each history database that stores the rating and is mapped to the primary key 'user_projectsId' of the user_projects table in MasterDatabase ...

    THE PROBLEM IS:

    We need to get the corresponding ratings of all the users on particular project within a time frame.

    Please note that each users data might be in a separate history database's Performance data table.

    Currently we have used 2 -3 approaches for the same.

    Approach 1:

    Use a cursor to fetch the DBID for each row and then get the corresponding data from the Performance data table in that particular DB in each iteration

    Approach 2:

    Use separate queries ...1 for each database...and then return the union of these resultsets...The problem with this approach is that the number of history databases keep growing and would result in a maintenance hell each time needing to add the handling for the new history database.

    Approach 3:

    Use left join with the Performance data table of each database and use a case statement to check for null results and show the result from the database that actually returns results.

    Obviously the problem would be the Perforamance/ efficiency issue. plus the problem associated with approach 2.

    I really am in a dilemma.

    Please help.

    Thanks

    Sharad

  • As I have been convinced yesterday that cursor should be eliminated whenever possible. I have another cursor which I would happy to get rid of if I can.

    -- create the test table

    IF OBJECT_ID('dbo.TestData') IS NOT NULL

    DROP TABLE dbo.TestData

    IF OBJECT_ID('dbo.FactRL') IS NOT NULL

    DROP TABLE dbo.FactRL

    IF OBJECT_ID('dbo.FactSL') IS NOT NULL

    DROP TABLE dbo.FactSL

    CREATE TABLE dbo.TestData

    (Name varchar(4) not null,

    RowNumber int not null,

    DataID int not null,

    CONSTRAINT [PK_TestData] PRIMARY KEY CLUSTERED (

    [RowNumber] ASC

    ))

    CREATE TABLE dbo.FactSL

    ( DataID int not null,

    DataText varchar(100) not null

    CONSTRAINT [PK_FactSL] PRIMARY KEY CLUSTERED (

    [DataID] ASC))

    CREATE TABLE dbo.FactRL

    ( DataID int not null,

    DataTime datetime,

    DataText varchar(100) not null

    CONSTRAINT [PK_FactRL] PRIMARY KEY CLUSTERED (

    [DataID] ASC))

    --populate test data use tally table as suggested, also RL and SL's DataID can be identical

    SET ROWCOUNT 100;

    SELECT IDENTITY(int,101,1) AS id INTO #Tally FROM sysobjects a, sysobjects b;

    SET ROWCOUNT 0;

    INSERT INTO dbo.TestData(Name, RowNumber, DataID) SELECT CASE WHEN tally.id/10%2 = 0 THEN 'RL' ELSE 'SL' END, tally.id - 100, tally.id FROM #Tally tally;

    INSERT INTO dbo.FactRL(DataID, DataTime, DataText) SELECT tally.id, GETDATE(), 'RL text value ' + CAST(tally.id - 100 AS VARCHAR) FROM #Tally tally

    INSERT INTO dbo.FactSL(DataID, DataText) SELECT tally.id, 'SL text value ' + CAST(tally.id - 100 AS VARCHAR) FROM #Tally tally

    DROP TABLE #Tally;

    --prepare the used function solution provided by peter

    create function dbo.TestNumberRange_nocursor1() returns table

    as

    return

    (

    with

    lowerEdgesQ( SegmentID, Name, RowNumber ) as

    (

    select

    row_number() over ( order by d.RowNumber asc )

    , d.Name

    , d.RowNumber

    from

    dbo.TestData as d

    where

    -- detects lower edge of each name

    not exists( select 1 from dbo.TestData as i where i.Name = d.Name and i.RowNumber = d.RowNumber - 1 )

    )

    , upperEdgesQ( SegmentID, Name, RowNumber ) as

    (

    select

    row_number() over ( order by d.RowNumber asc )

    , d.Name

    , d.RowNumber

    from

    dbo.TestData as d

    where

    -- detects upper edge of each name

    not exists( select 1 from dbo.TestData as i where i.Name = d.Name and i.RowNumber = d.RowNumber + 1 )

    )

    select

    lowerEdgesQ.Name

    , lowerEdgesQ.RowNumber as StartNum

    , upperEdgesQ.RowNumber as EndNum

    from

    lowerEdgesQ

    inner join upperEdgesQ on upperEdgesQ.SegmentID = lowerEdgesQ.SegmentID

    )

    ;

    -- THE CURSOR I thought of use CASE statement, but then it seems to me it can not be used with join

    declare @DataFact table

    (

    RowNumber int not null,

    DataIDintnot null,

    DataTimedatetimenull,

    NameFromvarchar(4)not null,

    DataText varchar(100) null

    )

    declare @DataRange table

    (

    TableFrom varchar(4)not null,

    StartNumintnot null,

    EndNumintnot null

    )

    declare @tableFrom varchar(4),

    @startRowNo int,

    @endRowNo int

    insert into @DataRange (TableFrom,

    StartNum ,

    EndNum)

    select * from dbo.TestNumberRange_nocursor1()

    declare datarange_cursor cursor

    for

    select TableFrom, StartNum, EndNum from @DataRange

    open datarange_cursor

    fetch datarange_cursor

    into @tableFrom, @startRowNo, @endRowNo

    while (@@fetch_status = 0)

    begin

    if @tableFrom = 'RL'

    insert into @DataFact (RowNumber,

    DataID, DataTime, NameFrom, DataText)

    select

    td.RowNumber,

    td.DataID,

    rl.DataTime,

    'RL' as NameFrom,

    rl.DataText

    fromTestData td

    JOIN FactRL rl ON td.DataID = rl.DataID

    wheretd.RowNumber between @startRowNo and @endRowNo

    else if @tableFrom = 'SL'

    insert into @DataFact (RowNumber,

    DataID, DataTime, NameFrom, DataText)

    select

    td.RowNumber,

    td.DataID,

    NULL as DataTime,

    'SL' as NameFrom,

    sl.DataText

    fromTestData td

    JOIN FactSL sl ON td.DataID = sl.DataID

    wheretd.RowNumber between @startRowNo and @endRowNo

    fetch datarange_cursor into @tableFrom, @startRowNo, @endRowNo

    end

    close datarange_cursor

    deallocate datarange_cursor

    select * from @DataFactorder by RowNumber

Viewing 15 posts - 106 through 120 (of 316 total)

You must be logged in to reply to this topic. Login to reply