There Must Be 15 Ways To Lose Your Cursors… Part 2: Just Put It in a S

Question

There Must Be 15 Ways To Lose Your Cursors… Part 2: Just Put It in a S

Viewing 15 posts - 106 through 120 (of 316 total)

You must be logged in to reply to this topic. Login to reply

timothypepin SSC Veteran Points: 249 More actions · Answer 1

Andy DBA (4/28/2009)
I think suggestions for proper formatting and good comments are left out of many discussions and examples because the focus is more on performance (rightly so) and also because nobody wants to type in 100 lines when 10 will prove their point. I would never recommend sacrificing performance for readability, but queries and sps will perform just as well when peppered with comments and spread out over many lines.
Hopefully nobody out there is choosing an inefficient approach to their work just because they think it will be easier for the next developer to understand and maintain.

You point about formatting and commenting are well noted. They will both do wonders for making the code maintainable, but I also believe that very nature of programming leads to trade offs between efficiency and maintainability.

You are normally are taking a complex task and breaking it into manageable step. This normally leads to some inefficiency. Its the programmers job to find the correct balance.

My original problem was a case in point; do I do each validation with a separate query; do I attempt to do it with a one very complicated query or do I do it with a limited number of fairly complex queries. I think a valid case can be made for each approach.

GSquared SSC Guru Points: 260824 More actions · Answer 2

I the kind of data cleaning that you're talking about, I generally break it up into separate tests. Just make each test work on the whole table, or applicable subset of the table, all at once, instead of one row at a time.

For example, let's say you have a test to make sure that the data in the DateEntered field is actually a valid date. You can run that test one row at a time, or you can do something like:

Insert into dbo.ErrorRecords (RecordID, ErrorCode)

select RecordID, 1 as ErrorCode

from dbo.MyTable

where IsDate(DateEntered) = 0;

That single-pass operation is going to be a lot more efficient than checking row-by-row.

I think that's all we're talking about here. Right?

The advantage to splitting each test off into its own proc is that the master proc doesn't have to then store an execution plan that's complex enough to handle all the cases in all the tests. Splitting it up and having one master that calls one sub per test, can result in significant improvements in efficiency and speed, because of that factor.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 3

tpepin (4/28/2009)
Andy DBA (4/28/2009)
I think suggestions for proper formatting and good comments are left out of many discussions and examples because the focus is more on performance (rightly so) and also because nobody wants to type in 100 lines when 10 will prove their point. I would never recommend sacrificing performance for readability, but queries and sps will perform just as well when peppered with comments and spread out over many lines.
Hopefully nobody out there is choosing an inefficient approach to their work just because they think it will be easier for the next developer to understand and maintain.
You point about formatting and commenting are well noted. They will both do wonders for making the code maintainable, but I also believe that very nature of programming leads to trade offs between efficiency and maintainability.
You are normally are taking a complex task and breaking it into manageable step. This normally leads to some inefficiency. Its the programmers job to find the correct balance.
My original problem was a case in point; do I do each validation with a separate query; do I attempt to do it with a one very complicated query or do I do it with a limited number of fairly complex queries. I think a valid case can be made for each approach.

But using a cursors means checking each row one at a time. If it takes 100 ms to check 1 record, how long will it take to check 1,000,000 records? I'll let you do the math.

At this point you might say, "but I'm only processing 1,000 records." that may be true now, but what about 3 months from now, or 12 months, or 36 months? What happens if the volume increases 100 or 1000 fold? Your cursor solution won't scale with the increase in volume.

You do need to plan for that which is not foreseen. The requirement that will never change will change.

timothypepin SSC Veteran Points: 249 More actions · Answer 4

Lynn Pettis (4/28/2009)
tpepin (4/28/2009)
Andy DBA (4/28/2009)
I think suggestions for proper formatting and good comments are left out of many discussions and examples because the focus is more on performance (rightly so) and also because nobody wants to type in 100 lines when 10 will prove their point. I would never recommend sacrificing performance for readability, but queries and sps will perform just as well when peppered with comments and spread out over many lines.
Hopefully nobody out there is choosing an inefficient approach to their work just because they think it will be easier for the next developer to understand and maintain.
You point about formatting and commenting are well noted. They will both do wonders for making the code maintainable, but I also believe that very nature of programming leads to trade offs between efficiency and maintainability.
You are normally are taking a complex task and breaking it into manageable step. This normally leads to some inefficiency. Its the programmers job to find the correct balance.
My original problem was a case in point; do I do each validation with a separate query; do I attempt to do it with a one very complicated query or do I do it with a limited number of fairly complex queries. I think a valid case can be made for each approach.
But using a cursors means checking each row one at a time. If it takes 100 ms to check 1 record, how long will it take to check 1,000,000 records? I'll let you do the math.
At this point you might say, "but I'm only processing 1,000 records." that may be true now, but what about 3 months from now, or 12 months, or 36 months? What happens if the volume increases 100 or 1000 fold? Your cursor solution won't scale with the increase in volume.
You do need to plan for that which is not foreseen. The requirement that will never change will change.

You have all convinced me that the cursor method is definitely not the way to do it.

The only thing we are debating now are the merits of doing the processing with one set operations as opposed to breaking it down into a series of set operations, but we are probably getting off topic for this thread.

GSquared SSC Guru Points: 260824 More actions · Answer 5

tpepin (4/28/2009)
Lynn Pettis (4/28/2009)
tpepin (4/28/2009)
Andy DBA (4/28/2009)
I think suggestions for proper formatting and good comments are left out of many discussions and examples because the focus is more on performance (rightly so) and also because nobody wants to type in 100 lines when 10 will prove their point. I would never recommend sacrificing performance for readability, but queries and sps will perform just as well when peppered with comments and spread out over many lines.
Hopefully nobody out there is choosing an inefficient approach to their work just because they think it will be easier for the next developer to understand and maintain.
You point about formatting and commenting are well noted. They will both do wonders for making the code maintainable, but I also believe that very nature of programming leads to trade offs between efficiency and maintainability.
You are normally are taking a complex task and breaking it into manageable step. This normally leads to some inefficiency. Its the programmers job to find the correct balance.
My original problem was a case in point; do I do each validation with a separate query; do I attempt to do it with a one very complicated query or do I do it with a limited number of fairly complex queries. I think a valid case can be made for each approach.
But using a cursors means checking each row one at a time. If it takes 100 ms to check 1 record, how long will it take to check 1,000,000 records? I'll let you do the math.
At this point you might say, "but I'm only processing 1,000 records." that may be true now, but what about 3 months from now, or 12 months, or 36 months? What happens if the volume increases 100 or 1000 fold? Your cursor solution won't scale with the increase in volume.
You do need to plan for that which is not foreseen. The requirement that will never change will change.
You have all convinced me that the cursor method is definitely not the way to do it.
The only thing we are debating now are the merits of doing the processing with one set operations as opposed to breaking it down into a series of set operations, but we are probably getting off topic for this thread.

Series of set operations will almost certainly end up being better.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 6

I'll have to agree with Gus on this, but it would help to see the current code to really make a determination.

gserdijn Hall of Fame Points: 3290 More actions · Answer 7

This was what I intended for the problem on page 8 which happens to look much the same as Samuel Vella's solution.

SET ROWCOUNT 5000;

SELECT IDENTITY(INT,0,1) AS id INTO #Tally FROM sysobjects a, sysobjects b;

SET ROWCOUNT 0;

DECLARE @data TABLE (row INT IDENTITY PRIMARY KEY, name CHAR(2));

INSERT INTO @data (name) SELECT CASE WHEN id % 10 < 5 THEN 'RL' ELSE 'SL' END

FROM #Tally ORDER BY id;

DROP TABLE #Tally;

WITH EndData (name, beginrow, endrow, sort ) AS

(SELECT data.name, 0, data.row, ROW_NUMBER () OVER (ORDER BY data.row)

FROM @data data LEFT JOIN @data data2 ON data.row = data2.row - 1

WHERE data.name data2.name OR data2.name IS NULL

),

BeginData (name, beginrow, endrow, sort ) AS

(SELECT data.name, data.row, 0, ROW_NUMBER () OVER (ORDER BY data.row)

FROM @data data LEFT JOIN @data data2 ON data.row = data2.row + 1

WHERE data.name data2.name OR data2.name IS NULL

)

SELECT begindata.name, begindata.beginrow, enddata.endrow FROM BeginData JOIN EndData ON BeginData.sort = EndData.sort

ORDER BY BeginData.beginrow;

When testing, it did not scale too well on our server:

[font="Courier New"]

CpuMs LogRds Elapsed

100 rows:

0 883 10

500 rows:

125 3613 120

1000 rows:

484 7054 442

2000 rows:

1748 22339 1718

3000 rows:

3791 46001 4001

5000 rows:

10577 117645 10651

Etcetera...

[/font]

Did I do something wrong, or is it unwise to use a table variable for such operations? Or, eh..., both?

Dutch Anti-RBAR League

Kit G SSCrazy Points: 2888 More actions · Answer 8

Seems to me a series of procs called by a master proc would be better as that way if/when the data changes, you only have to add a new proc to test for a new error or delete a proc from the Master proc if that test is no longer viable. Saves having to rewrite the whole code if your data changes just slightly.

-- Kit

gserdijn Hall of Fame Points: 3290 More actions · Answer 9

It's just a way to populate a table/temp table/etc... with some values.

http://www.sqlservercentral.com/articles/TSQL/62867/

Dutch Anti-RBAR League

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 10

gserdijn (4/28/2009)
This was what I intended for the problem on page 8 which happens to look much the same as Samuel Vella's solution.
SET ROWCOUNT 5000;
SELECT IDENTITY(INT,0,1) AS id INTO #Tally FROM sysobjects a, sysobjects b;
SET ROWCOUNT 0;
DECLARE @data TABLE (row INT IDENTITY PRIMARY KEY, name CHAR(2));
INSERT INTO @data (name) SELECT CASE WHEN id % 10 < 5 THEN 'RL' ELSE 'SL' END
FROM #Tally ORDER BY id;
DROP TABLE #Tally;
WITH EndData (name, beginrow, endrow, sort ) AS
(SELECT data.name, 0, data.row, ROW_NUMBER () OVER (ORDER BY data.row)
FROM @data data LEFT JOIN @data data2 ON data.row = data2.row - 1
WHERE data.name data2.name OR data2.name IS NULL
),
BeginData (name, beginrow, endrow, sort ) AS
(SELECT data.name, data.row, 0, ROW_NUMBER () OVER (ORDER BY data.row)
FROM @data data LEFT JOIN @data data2 ON data.row = data2.row + 1
WHERE data.name data2.name OR data2.name IS NULL
)
SELECT begindata.name, begindata.beginrow, enddata.endrow FROM BeginData JOIN EndData ON BeginData.sort = EndData.sort
ORDER BY BeginData.beginrow;
When testing, it did not scale too well on our server:
[font="Courier New"]
CpuMs LogRds Elapsed
100 rows:
0 883 10
500 rows:
125 3613 120
1000 rows:
484 7054 442
2000 rows:
1748 22339 1718
3000 rows:
3791 46001 4001
5000 rows:
10577 117645 10651
Etcetera...
[/font]
Did I do something wrong, or is it unwise to use a table variable for such operations? Or, eh..., both?

Table variables are not a good choice when you have a large number of records. The query optimizer assumes that the number of records in a table variable is alsways 1 so you won't get an efficient query plan if you have a large number of records. Based on the query above, you probably should have used a properly indexed temporary table instead.

Bruce W Cassidy SSChampion Points: 12668 More actions · Answer 11

[font="Verdana"]There's a point I'd like to raise.

When you are looking at something like a cursor and seeing how well it performs over a small number of rows, it's easy to think "well, that seems to be the best approach then."

There's an inherent assumption here: that your code will be the only code performing at the time. So how do cursors perform when you have 10 users running the same code at the same time? Or 100?

Just like many new database programmers use cursors (or equivalents) because they are familiar with procedural logic, so do they also assume that a test on an effectively single user system is enough.

When you are developing database code, you need to ask some questions of the code:

1. What is the life time of this code? If it's a throw away solution, then who cares. But if it's not, the chances are your code will be around for years. And if so, it's likely to be amended and reused.

2. What is the scalability of the code? In terms of data growth and parallelism and concurrency?

Against both of these questions, cursors fair poorly. So it's not enough to say "oh, I tried my cursor code with 100 rows, and it works fine thank you." What is the impact of introducing that code into a production system? Now, 1 year from now, 10 years from now?

That's why it's a good idea to learn the techniques that scale well. Get in the habit of using them. Practice and refine. Cursors are like comfort blankets... at some point they are best left behind.

[/font]

katedgrt Ten Centuries Points: 1197 More actions · Answer 12

Hey, you leave my blankey out of it... 😉

😎 Kate The Great :w00t:
If you don't have time to do it right the first time, where will you find time to do it again?

hila SSC Enthusiast Points: 124 More actions · Answer 13

Heaps of thanks,

I have a cursor inside a trigger, and I always thought Cursor are from the word 'Curse'.

I had a stored proc written with the C word which I converted, after reading your third part I'm going to deal with my Cursor in a Trigger problem.

once again, Heaps of Thanks, your place in heaven is guaranteed.

sharad sinha Newbie Points: 3 More actions · Answer 14

Hi Barry,

Been following your article and it really is nice...

I have a case though where I am not able to remove the cursor/ or iteration approach for that matter.. Can you suggest something?

I'll give a dummy scenario...

SHEMA INFORMATION:

We have the following tables in database (say : MasterDatabase)

1. a user table containing data about the users...

2. a project table listing the various projects

3. a link table say user_projects linking the users to projects on a time basis ( i.e. from 'date' to 'date')

Since the data is huge there are regular backups of the user's Performance data and stored in separate databases say "PerformanceHistoryDB1", "PerformanceHistoryDB2", ""PerformanceHistoryDB3" and so on...

When this backup occurs the user_projects table is updated to store the corresponding database reference in a column say BackupDBID in user_projects table for each row where the corresponding performance data can be found (for a user on a particular project within a time frame)

All these history databases have the same schema. In particular, there is a table

Performancedata within each history database that stores the rating and is mapped to the primary key 'user_projectsId' of the user_projects table in MasterDatabase ...

THE PROBLEM IS:

We need to get the corresponding ratings of all the users on particular project within a time frame.

Please note that each users data might be in a separate history database's Performance data table.

Currently we have used 2 -3 approaches for the same.

Approach 1:

Use a cursor to fetch the DBID for each row and then get the corresponding data from the Performance data table in that particular DB in each iteration

Approach 2:

Use separate queries ...1 for each database...and then return the union of these resultsets...The problem with this approach is that the number of history databases keep growing and would result in a maintenance hell each time needing to add the handling for the new history database.

Approach 3:

Use left join with the Performance data table of each database and use a case statement to check for null results and show the result from the database that actually returns results.

Obviously the problem would be the Perforamance/ efficiency issue. plus the problem associated with approach 2.

I really am in a dilemma.

Please help.

Thanks

Sharad

febrary26 SSC Enthusiast Points: 114 More actions · Answer 15

As I have been convinced yesterday that cursor should be eliminated whenever possible. I have another cursor which I would happy to get rid of if I can.

-- create the test table

IF OBJECT_ID('dbo.TestData') IS NOT NULL

DROP TABLE dbo.TestData

IF OBJECT_ID('dbo.FactRL') IS NOT NULL

DROP TABLE dbo.FactRL

IF OBJECT_ID('dbo.FactSL') IS NOT NULL

DROP TABLE dbo.FactSL

CREATE TABLE dbo.TestData

(Name varchar(4) not null,

RowNumber int not null,

DataID int not null,

CONSTRAINT [PK_TestData] PRIMARY KEY CLUSTERED (

[RowNumber] ASC

))

CREATE TABLE dbo.FactSL

( DataID int not null,

DataText varchar(100) not null

CONSTRAINT [PK_FactSL] PRIMARY KEY CLUSTERED (

[DataID] ASC))

CREATE TABLE dbo.FactRL

( DataID int not null,

DataTime datetime,

DataText varchar(100) not null

CONSTRAINT [PK_FactRL] PRIMARY KEY CLUSTERED (

[DataID] ASC))

--populate test data use tally table as suggested, also RL and SL's DataID can be identical

SET ROWCOUNT 100;

SELECT IDENTITY(int,101,1) AS id INTO #Tally FROM sysobjects a, sysobjects b;

SET ROWCOUNT 0;

INSERT INTO dbo.TestData(Name, RowNumber, DataID) SELECT CASE WHEN tally.id/10%2 = 0 THEN 'RL' ELSE 'SL' END, tally.id - 100, tally.id FROM #Tally tally;

INSERT INTO dbo.FactRL(DataID, DataTime, DataText) SELECT tally.id, GETDATE(), 'RL text value ' + CAST(tally.id - 100 AS VARCHAR) FROM #Tally tally

INSERT INTO dbo.FactSL(DataID, DataText) SELECT tally.id, 'SL text value ' + CAST(tally.id - 100 AS VARCHAR) FROM #Tally tally

DROP TABLE #Tally;

--prepare the used function solution provided by peter

create function dbo.TestNumberRange_nocursor1() returns table

as

return

(

with

lowerEdgesQ( SegmentID, Name, RowNumber ) as

(

select

row_number() over ( order by d.RowNumber asc )

, d.Name

, d.RowNumber

from

dbo.TestData as d

where

-- detects lower edge of each name

not exists( select 1 from dbo.TestData as i where i.Name = d.Name and i.RowNumber = d.RowNumber - 1 )

)

, upperEdgesQ( SegmentID, Name, RowNumber ) as

(

select

row_number() over ( order by d.RowNumber asc )

, d.Name

, d.RowNumber

from

dbo.TestData as d

where

-- detects upper edge of each name

not exists( select 1 from dbo.TestData as i where i.Name = d.Name and i.RowNumber = d.RowNumber + 1 )

)

select

lowerEdgesQ.Name

, lowerEdgesQ.RowNumber as StartNum

, upperEdgesQ.RowNumber as EndNum

from

lowerEdgesQ

inner join upperEdgesQ on upperEdgesQ.SegmentID = lowerEdgesQ.SegmentID

)

;

-- THE CURSOR I thought of use CASE statement, but then it seems to me it can not be used with join

declare @DataFact table

(

RowNumber int not null,

DataIDintnot null,

DataTimedatetimenull,

NameFromvarchar(4)not null,

DataText varchar(100) null

)

declare @DataRange table

(

TableFrom varchar(4)not null,

StartNumintnot null,

EndNumintnot null

)

declare @tableFrom varchar(4),

@startRowNo int,

@endRowNo int

insert into @DataRange (TableFrom,

StartNum ,

EndNum)

select * from dbo.TestNumberRange_nocursor1()

declare datarange_cursor cursor

for

select TableFrom, StartNum, EndNum from @DataRange

open datarange_cursor

fetch datarange_cursor

into @tableFrom, @startRowNo, @endRowNo

while (@@fetch_status = 0)

begin

if @tableFrom = 'RL'

insert into @DataFact (RowNumber,

DataID, DataTime, NameFrom, DataText)

select

td.RowNumber,

td.DataID,

rl.DataTime,

'RL' as NameFrom,

rl.DataText

fromTestData td

JOIN FactRL rl ON td.DataID = rl.DataID

wheretd.RowNumber between @startRowNo and @endRowNo

else if @tableFrom = 'SL'

insert into @DataFact (RowNumber,

DataID, DataTime, NameFrom, DataText)

select

td.RowNumber,

td.DataID,

NULL as DataTime,

'SL' as NameFrom,

sl.DataText

fromTestData td

JOIN FactSL sl ON td.DataID = sl.DataID

wheretd.RowNumber between @startRowNo and @endRowNo

fetch datarange_cursor into @tableFrom, @startRowNo, @endRowNo

end

close datarange_cursor

deallocate datarange_cursor

select * from @DataFactorder by RowNumber