There Must Be 15 Ways To Lose Your Cursors… Part 2: Just Put It in a S

  • min.li,

    without writing a whole pile of SQL here's a few tips

    1. union

    union can be used to compile datesets

    e.g.

    select *

    from TableA TA

    inner join TableB TB

    on TA.ID = TB.ID

    where TA.Column = 'RL'

    UNION ALL

    select *

    from TableA TA

    inner join TableC TC

    on TA.ID = TC.ID

    where TA.Column = 'SL'

    UNION ALL

    select *

    from TableA TA

    inner join TableD TD

    on TA.ID = TD.ID

    where TA.Column = 'TL'

    2. better use of the join criteria

    Select TA.id, isnull(TB.Column, TC.column)

    from TableA TA

    left outer join TableB TB

    on TA.ID = TB.ID

    and TA.column = 'RL'

    left outer join TableC TC

    on TA.ID = TC.ID

    and TA.column = 'SL'

    3. Derived Tables (can also be done with a CTE)

    Select *

    from TableA TA

    inner Joint (select 'RL' as dtype, ID, Column from TableB

    union all

    Select 'SL' as dtype, ID, Column from TableC) DT

    on TA.ID = DT.ID

    and TA.Column =DT.dtype

  • This query leads to an identical resultset, but I have no idea what you are trying to accomplish.

    SELECT TestData.RowNumber, TestData.DataID, FactRL.DataTime,TestData.Name NameFrom, COALESCE(FactRL.datatext,factsl.datatext) DataText

    FROM TestData

    LEFT JOIN FactRL ON TestData.DataID = FactRL.DataID

    LEFT JOIN FactsL ON TestData.DataID = FactsL.DataID

    ORDER BY TestData.RowNumber


    Dutch Anti-RBAR League

  • And, off course, the populating of the tables can be done without a LOOP.

    SET ROWCOUNT 10000;

    SELECT IDENTITY(INT,101,1) AS id INTO #Tally FROM sysobjects a, sysobjects b;

    SET ROWCOUNT 0;

    INSERT INTO dbo.TestData(Name, RowNumber, DataID) SELECT CASE WHEN tally.id/10%2 = 0 THEN 'RL' ELSE 'SL' END, tally.id - 100, tally.id FROM #Tally tally;

    INSERT INTO dbo.FactRL(DataID, DataTime, DataText) SELECT tally.id, GETDATE(), 'RL text value ' + CAST(tally.id - 100 AS VARCHAR) FROM #Tally tally WHERE tally.id/10%2 = 0;

    INSERT INTO dbo.FactSL(DataID, DataText) SELECT tally.id, 'SL text value ' + CAST(tally.id - 100 AS VARCHAR) FROM #Tally tally WHERE tally.id/10%2 0;

    DROP TABLE #Tally;

    Tally ho!:-D


    Dutch Anti-RBAR League

  • SELECT TestData.RowNumber, TestData.DataID, FactRL.DataTime,TestData.Name NameFrom, COALESCE(FactRL.datatext,factsl.datatext) DataText

    FROM TestData

    LEFT JOIN FactRL ON TestData.DataID = FactRL.DataID

    LEFT JOIN FactsL ON TestData.DataID = FactsL.DataID

    ORDER BY TestData.RowNumber

    I think there is a few issues with this solution:

    In real scenario, Firstly, the DataID value could be same in the FactRL and FactSL table, I just fast populate the test data in the tables which seem have distinct value in two tables. Secondly, there are actually more than two FactXX table involved, I only list two for simplicity. Third, there are more fields merge from different tables than illustrated here, and field name are not granteed to be same from different source table (data type is same though).

  • Regarding the duplicate DataIDs: Well, then we can use a UNION.

    SELECT TestData.RowNumber, TestData.DataID, FactRL.DataTime,TestData.Name NameFrom, FactRL.DataText

    FROM TestData

    JOIN FactRL ON TestData.DataID = FactRL.DataID

    UNION

    SELECT TestData.RowNumber, TestData.DataID, NULL ,TestData.Name NameFrom, factsl.datatext DataText

    FROM TestData

    JOIN FactSL ON TestData.DataID = FactsL.DataID

    /* UNION other tables unless their amount varies */

    ORDER BY TestData.RowNumber

    And add some extra UNIONS in case of TL, UL, ...


    Dutch Anti-RBAR League

  • gserdijn (4/29/2009)


    Regarding the duplicate DataIDs: Well, then we can use a UNION.

    SELECT TestData.RowNumber, TestData.DataID, FactRL.DataTime,TestData.Name NameFrom, FactRL.DataText

    FROM TestData

    JOIN FactRL ON TestData.DataID = FactRL.DataID

    UNION

    SELECT TestData.RowNumber, TestData.DataID, NULL ,TestData.Name NameFrom, factsl.datatext DataText

    FROM TestData

    JOIN FactSL ON TestData.DataID = FactsL.DataID

    /* UNION other tables unless their amount varies */

    ORDER BY TestData.RowNumber

    And add some extra UNIONS in case of TL, UL, ...

    If you test it with my new edited populate test data script, you will see that this solution is not correct. The join have to be based on the value of the TestData.Name. The number of records in return resultset should be 100 instead of 200.

  • Again, I don't see anything that needs a cursor, let me have a go at it!

  • Back...

    First I modified one of the source tables after looking at the queries.

    create nonclustered index ix_TestData_DataID on dbo.TestData ( DataID );

    I also added one more result column to my previous function as it might come in handy at some point.

    alter function dbo.TestNumberRange_nocursor1() returns table

    as

    return

    (

    with

    lowerEdgesQ( SegmentID, Name, RowNumber ) as

    (

    select

    row_number() over ( order by d.RowNumber asc )

    , d.Name

    , d.RowNumber

    from

    dbo.TestData as d

    where

    -- detects lower edge of each name

    not exists( select 1 from dbo.TestData as i where i.Name = d.Name and i.RowNumber = d.RowNumber - 1 )

    )

    , upperEdgesQ( SegmentID, Name, RowNumber ) as

    (

    select

    row_number() over ( order by d.RowNumber asc )

    , d.Name

    , d.RowNumber

    from

    dbo.TestData as d

    where

    -- detects upper edge of each name

    not exists( select 1 from dbo.TestData as i where i.Name = d.Name and i.RowNumber = d.RowNumber + 1 )

    )

    select

    lowerEdgesQ.SegmentID as SegmentID

    , lowerEdgesQ.Name as Name

    , lowerEdgesQ.RowNumber as StartNum

    , upperEdgesQ.RowNumber as EndNum

    from

    lowerEdgesQ

    inner join upperEdgesQ on upperEdgesQ.SegmentID = lowerEdgesQ.SegmentID

    )

    ;

    My conversion of the to be improved code min.li provided

    ;

    with

    segmentQ( Name, StartNum, EndNum ) as

    (

    select

    tn.Name

    , tn.StartNum

    , tn.EndNum

    from

    dbo.TestNumberRange_nocursor1() as tn

    /* Add this where if you do not want to process all segements every time.

    where

    tn.SegmentID between 1 and 5

    */

    )

    /* RL data */

    , RLdataQ( NameFrom, RowNumber, DataID, DataTime, DataText ) as

    (

    select

    s.Name

    , t.RowNumber

    , t.DataID

    , f.DataTime

    , f.DataText

    from

    segmentQ as s

    inner join dbo.TestData as t on t.RowNumber between s.StartNum and s.EndNum

    inner join dbo.FactRL as f on f.DataID = t.DataID

    where

    s.Name = 'RL'

    )

    /* SL data */

    , SLdataQ( NameFrom, RowNumber, DataID, DataTime, DataText ) as

    (

    select

    s.Name

    , t.RowNumber

    , t.DataID

    , NULL

    , f.DataText

    from

    segmentQ as s

    inner join dbo.TestData as t on t.RowNumber between s.StartNum and s.EndNum

    inner join dbo.FactSL as f on f.DataID = t.DataID

    where

    s.Name = 'SL'

    )

    /* The actual query */

    select RowNumber, DataID, DataTime, NameFrom, DataText from SLdataQ

    union all select RowNumber, DataID, DataTime, NameFrom, DataText from RLdataQ

    order by

    RowNumber

    ;

    You can see I kept a lot of whitespace in the query to aid readability and put the partial results in separate common table expressions in order to keep the final query as clear as possible. The where clauses in the table specific queries are essential for speed as they cut down the number of to be processed rows before any joints take place. They thus have the same function as the IF constructs in the original code!

    Now the results you all been waiting for 🙂

    For 100 records as in the supplied test script:

    /* cursor original by min.li */

    CpuMs LogRds Elapsed

    ----------- -------------------- -----------

    0 863 3

    /* cursor free by peter */

    CpuMs LogRds Elapsed

    ----------- -------------------- -----------

    0 834 3

    Not shocking, but now let’s take a look at the 10.000 row case!

    /* cursor original by min.li */

    CpuMs LogRds Elapsed

    ----------- -------------------- -----------

    1672 37199 1694

    /* cursor free by peter */

    CpuMs LogRds Elapsed

    ----------- -------------------- -----------

    125 3328 149

    Oops, someone call 911 😉

    This demonstrates that cursors really don't scale and that you have to test your solution with above expected data set sizes. It magnifies problems you would otherwise not detect. And don’t forget the original code was already partially optimized by removing the first cursor. Imagine what it would have been like with both cursors in there and 10000 rows in the source table!

  • Thanks a million, peter.

    You enlightened me on CTEs. Although I have read a little on CTE, I haven't really used it so far. You let me see how powerful it is, especially on replacing cursors. I thought CTE was used for hierachy processing. I was convinced these two cursors was not replacable, as I was seeking approach out of CTE. Next time I see a cursor, I should think of CTE first.

  • Thanks min.li for the complement and as you I am happy with the outcome.

    Be aware though that technically CTEs are not strictly required for this solution, they do however really improve the readability of code by clearly expressing individual parts of the query. This becomes especially apparent when having many parts requiring the same input. In such cases CTEs can cut out redundant SQL code resulting in a better expression of intent and improved readability.

    Again, I am glad I could convince you to stay away from cursors.

  • min.li (4/29/2009)

    If you test it with my new edited populate test data script, you will see that this solution is not correct. The join have to be based on the value of the TestData.Name. The number of records in return resultset should be 100 instead of 200.

    Different input, different query.... 😉

    I don't mean to be obnoxious, but this gives an identical resultset.

    SELECT TestData.RowNumber, TestData.DataID,

    CASE WHEN testdata.name = 'RL' THEN FactRL.DataTime END AS DataTime,

    testdata.name,

    CASE WHEN testdata.name = 'RL' THEN factrl.datatext ELSE factsl.datatext END AS DataText

    FROM TestData

    LEFT JOIN FactRL ON TestData.DataID = FactRL.DataID

    LEFT JOIN FactsL ON TestData.DataID = FactsL.DataID

    ORDER BY TestData.RowNumber

    I will study peter's elegant solution tonight. Table function + CTE, learning every day!


    Dutch Anti-RBAR League

  • Hi every body,

    I had asked this on the 12 th page of this discussion. Seeing the active discussion, can I request you to post an answer on my query too? Please let me know incase further information (code/ queries etc ) is required.

    [Quote]

    "Hi Barry,

    Been following your article and it really is nice...

    I have a case though where I am not able to remove the cursor/ or iteration approach for that matter.. Can you suggest something?

    I'll give a dummy scenario...

    SHEMA INFORMATION:

    We have the following tables in database (say : MasterDatabase)

    1. a user table containing data about the users...

    2. a project table listing the various projects

    3. a link table say user_projects linking the users to projects on a time basis ( i.e. from 'date' to 'date')

    Since the data is huge there are regular backups of the user's Performance data and stored in separate databases say "PerformanceHistoryDB1", "PerformanceHistoryDB2", ""PerformanceHistoryDB3" and so on...

    When this backup occurs the user_projects table is updated to store the corresponding database reference in a column say BackupDBID in user_projects table for each row where the corresponding performance data can be found (for a user on a particular project within a time frame)

    All these history databases have the same schema. In particular, there is a table

    Performancedata within each history database that stores the rating and is mapped to the primary key 'user_projectsId' of the user_projects table in MasterDatabase ...

    THE PROBLEM IS:

    We need to get the corresponding ratings of all the users on particular project within a time frame.

    Please note that each users data might be in a separate history database's Performance data table.

    Currently we have used 2 -3 approaches for the same.

    Approach 1:

    Use a cursor to fetch the DBID for each row and then get the corresponding data from the Performance data table in that particular DB in each iteration

    Approach 2:

    Use separate queries ...1 for each database...and then return the union of these resultsets...The problem with this approach is that the number of history databases keep growing and would result in a maintenance hell each time needing to add the handling for the new history database.

    Approach 3:

    Use left join with the Performance data table of each database and use a case statement to check for null results and show the result from the database that actually returns results.

    Obviously the problem would be the Perforamance/ efficiency issue. plus the problem associated with approach 2.

    I really am in a dilemma.

    Please help.

    Thanks

    Sharad "

    [Unquote]

    Thanks again

    Sharad

  • I am happy to take a look at it, but right now I have to finish some actual work :(. In the meantime if you can make a test case with a functional correct structure and some fictual data and expected results, I am sure a few of us will jump on the challenge.

    If the test case is large, then add it as a SQL script to your post. And tell us the version of SQL Sevrer you use, anything 2005 and later preferred 😉

    Cheers

  • sharad sinha (4/29/2009)


    Approach 2:

    Use separate queries ...1 for each database...and then return the union of these resultsets...The problem with this approach is that the number of history databases keep growing and would result in a maintenance hell each time needing to add the handling for the new history database.

    The problem you seem to have with this particular solution is easily circumvented by using a view that does the union all and is used in all your queries. To me your case looks a lot like a classic partitioning situation with history stored in separate databases, where data is perfectly segregated by check constraint on a single attribute primary key.

    Read up on partitioning, the old and to me only familiar method is to make tables as described above and then make a view based on that. You can then even insert data into the view and the proper table in the proper database will be updated thanks to the check constraint on the PK. The only downside I experienced with that is that you have to specify every attribute of the view/table when doing an insert on it.

    You can also do a nightly rebuild of your view by a stored procedure. It can scan available databases, check their schema and if it obviously matches the one you seek add it to the view. Best is however to make this part of the deployment procedure that creates a new history database.

    Again, partitioning is what you are looking for me thinks!

  • Lynn Pettis (4/27/2009)


    greatheep (4/27/2009)


    So, I have to agree with some of the thoughts I've seen on the board - namely, so far, the article(s) have not shown a case where I would ever have thought of using a cursor. At this point, I'm not convinced that cursors can ALWAYS be replaced by set-based processing, though I agree that they should be avoided.

    Just to stir the pot, let me present you with a situation in which we are using a cursor here at my company - a situation where I cannot see a useful way to eliminate the cursor.

    So, we receive a data file from an external source on a regular basis. This file contains roughly 1.5 million rows, give or take. We load this file into a staging table, and then run a process on it. This process is a cursor, which opens a transaction, applies pieces of the data to various tables, and then marks the row as "complete" (by setting a field on the row) and commits the transaction. If any error is encountered while attempting to process the row, all the processing for that row is rolled back, the error message is written to an "error_msg" field on the row, and the row is marked as "in error". The cursor then moves on to the next row.

    The purpose of this is that if a single row is in error, the specific problem is identified, and the rest of the data is not held up by the problem. The problems could be of various kinds - problems in the data file that was sent to us, rows or pages in the target table(s) locked by a user at the moment, etc. This could allow us to only have to reprocess a few rows, rather than the entire 1.5 million rows, which takes a while. Also, nailing down WHICH row caused the issue when processing as a set is VERY difficult.

    So, the gauntlet is down - any suggestions on how to achieve the same benefits using set processing?

    In summary, we want:

    - If there is an error in a row, none of the various updates/inserts that are done should be kept - all should be rolled back.

    - Any rows which have no error should process completely, committing changes to the database.

    - We should be able to tell after the fact which rows processed completely, and which rows have an error, as well as the specific error each row experienced (they could be different for each row in error)

    Unfortunately, it's a broken guantlet. We can't do anything without requirements, DDL for the tables (staging and destination), sample data (in a consumable format -- see the first article linked in my signature block below), processing rules for the data, expected results based on the sample data provided. If part of the "errors" is data already existing in the destination table(s) be sure to provide sample data for them as well.

    greatheep, I'd be very interested to see the logic behind what you are doing and if you or anyone else can provide steps for improvement. We have similar needs in some of our financial/investment calculations in that we need to know exactly where something failed and require that the failure not affect other rows being processed.

    Most of these stored procedures were written several years ago and the fact is that they work, although they may not work in the most efficient manner and do not scale too well either. Testing the results of a re-write may take several man months and our efforts may be be better spent optimizing the queries or table indexes within the nested(!) cursors rather than rewriting this as set-based logic.

    If I had unlimited time and resources, a set-based rewrite would be the way to go.

Viewing 15 posts - 121 through 135 (of 316 total)

You must be logged in to reply to this topic. Login to reply