April 28, 2009 at 11:53 am
Andy DBA (4/28/2009)
I think suggestions for proper formatting and good comments are left out of many discussions and examples because the focus is more on performance (rightly so) and also because nobody wants to type in 100 lines when 10 will prove their point. I would never recommend sacrificing performance for readability, but queries and sps will perform just as well when peppered with comments and spread out over many lines.Hopefully nobody out there is choosing an inefficient approach to their work just because they think it will be easier for the next developer to understand and maintain.
You point about formatting and commenting are well noted. They will both do wonders for making the code maintainable, but I also believe that very nature of programming leads to trade offs between efficiency and maintainability.
You are normally are taking a complex task and breaking it into manageable step. This normally leads to some inefficiency. Its the programmers job to find the correct balance.
My original problem was a case in point; do I do each validation with a separate query; do I attempt to do it with a one very complicated query or do I do it with a limited number of fairly complex queries. I think a valid case can be made for each approach.
April 28, 2009 at 12:43 pm
I the kind of data cleaning that you're talking about, I generally break it up into separate tests. Just make each test work on the whole table, or applicable subset of the table, all at once, instead of one row at a time.
For example, let's say you have a test to make sure that the data in the DateEntered field is actually a valid date. You can run that test one row at a time, or you can do something like:
Insert into dbo.ErrorRecords (RecordID, ErrorCode)
select RecordID, 1 as ErrorCode
from dbo.MyTable
where IsDate(DateEntered) = 0;
That single-pass operation is going to be a lot more efficient than checking row-by-row.
I think that's all we're talking about here. Right?
The advantage to splitting each test off into its own proc is that the master proc doesn't have to then store an execution plan that's complex enough to handle all the cases in all the tests. Splitting it up and having one master that calls one sub per test, can result in significant improvements in efficiency and speed, because of that factor.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
April 28, 2009 at 12:45 pm
tpepin (4/28/2009)
Andy DBA (4/28/2009)
I think suggestions for proper formatting and good comments are left out of many discussions and examples because the focus is more on performance (rightly so) and also because nobody wants to type in 100 lines when 10 will prove their point. I would never recommend sacrificing performance for readability, but queries and sps will perform just as well when peppered with comments and spread out over many lines.Hopefully nobody out there is choosing an inefficient approach to their work just because they think it will be easier for the next developer to understand and maintain.
You point about formatting and commenting are well noted. They will both do wonders for making the code maintainable, but I also believe that very nature of programming leads to trade offs between efficiency and maintainability.
You are normally are taking a complex task and breaking it into manageable step. This normally leads to some inefficiency. Its the programmers job to find the correct balance.
My original problem was a case in point; do I do each validation with a separate query; do I attempt to do it with a one very complicated query or do I do it with a limited number of fairly complex queries. I think a valid case can be made for each approach.
But using a cursors means checking each row one at a time. If it takes 100 ms to check 1 record, how long will it take to check 1,000,000 records? I'll let you do the math.
At this point you might say, "but I'm only processing 1,000 records." that may be true now, but what about 3 months from now, or 12 months, or 36 months? What happens if the volume increases 100 or 1000 fold? Your cursor solution won't scale with the increase in volume.
You do need to plan for that which is not foreseen. The requirement that will never change will change.
April 28, 2009 at 1:24 pm
Lynn Pettis (4/28/2009)
tpepin (4/28/2009)
Andy DBA (4/28/2009)
I think suggestions for proper formatting and good comments are left out of many discussions and examples because the focus is more on performance (rightly so) and also because nobody wants to type in 100 lines when 10 will prove their point. I would never recommend sacrificing performance for readability, but queries and sps will perform just as well when peppered with comments and spread out over many lines.Hopefully nobody out there is choosing an inefficient approach to their work just because they think it will be easier for the next developer to understand and maintain.
You point about formatting and commenting are well noted. They will both do wonders for making the code maintainable, but I also believe that very nature of programming leads to trade offs between efficiency and maintainability.
You are normally are taking a complex task and breaking it into manageable step. This normally leads to some inefficiency. Its the programmers job to find the correct balance.
My original problem was a case in point; do I do each validation with a separate query; do I attempt to do it with a one very complicated query or do I do it with a limited number of fairly complex queries. I think a valid case can be made for each approach.
But using a cursors means checking each row one at a time. If it takes 100 ms to check 1 record, how long will it take to check 1,000,000 records? I'll let you do the math.
At this point you might say, "but I'm only processing 1,000 records." that may be true now, but what about 3 months from now, or 12 months, or 36 months? What happens if the volume increases 100 or 1000 fold? Your cursor solution won't scale with the increase in volume.
You do need to plan for that which is not foreseen. The requirement that will never change will change.
You have all convinced me that the cursor method is definitely not the way to do it.
The only thing we are debating now are the merits of doing the processing with one set operations as opposed to breaking it down into a series of set operations, but we are probably getting off topic for this thread.
April 28, 2009 at 1:30 pm
tpepin (4/28/2009)
Lynn Pettis (4/28/2009)
tpepin (4/28/2009)
Andy DBA (4/28/2009)
I think suggestions for proper formatting and good comments are left out of many discussions and examples because the focus is more on performance (rightly so) and also because nobody wants to type in 100 lines when 10 will prove their point. I would never recommend sacrificing performance for readability, but queries and sps will perform just as well when peppered with comments and spread out over many lines.Hopefully nobody out there is choosing an inefficient approach to their work just because they think it will be easier for the next developer to understand and maintain.
You point about formatting and commenting are well noted. They will both do wonders for making the code maintainable, but I also believe that very nature of programming leads to trade offs between efficiency and maintainability.
You are normally are taking a complex task and breaking it into manageable step. This normally leads to some inefficiency. Its the programmers job to find the correct balance.
My original problem was a case in point; do I do each validation with a separate query; do I attempt to do it with a one very complicated query or do I do it with a limited number of fairly complex queries. I think a valid case can be made for each approach.
But using a cursors means checking each row one at a time. If it takes 100 ms to check 1 record, how long will it take to check 1,000,000 records? I'll let you do the math.
At this point you might say, "but I'm only processing 1,000 records." that may be true now, but what about 3 months from now, or 12 months, or 36 months? What happens if the volume increases 100 or 1000 fold? Your cursor solution won't scale with the increase in volume.
You do need to plan for that which is not foreseen. The requirement that will never change will change.
You have all convinced me that the cursor method is definitely not the way to do it.
The only thing we are debating now are the merits of doing the processing with one set operations as opposed to breaking it down into a series of set operations, but we are probably getting off topic for this thread.
Series of set operations will almost certainly end up being better.
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
April 28, 2009 at 1:44 pm
I'll have to agree with Gus on this, but it would help to see the current code to really make a determination.
April 28, 2009 at 2:32 pm
This was what I intended for the problem on page 8 which happens to look much the same as Samuel Vella's solution.
SET ROWCOUNT 5000;
SELECT IDENTITY(INT,0,1) AS id INTO #Tally FROM sysobjects a, sysobjects b;
SET ROWCOUNT 0;
DECLARE @data TABLE (row INT IDENTITY PRIMARY KEY, name CHAR(2));
INSERT INTO @data (name) SELECT CASE WHEN id % 10 < 5 THEN 'RL' ELSE 'SL' END
FROM #Tally ORDER BY id;
DROP TABLE #Tally;
WITH EndData (name, beginrow, endrow, sort ) AS
(SELECT data.name, 0, data.row, ROW_NUMBER () OVER (ORDER BY data.row)
FROM @data data LEFT JOIN @data data2 ON data.row = data2.row - 1
WHERE data.name data2.name OR data2.name IS NULL
),
BeginData (name, beginrow, endrow, sort ) AS
(SELECT data.name, data.row, 0, ROW_NUMBER () OVER (ORDER BY data.row)
FROM @data data LEFT JOIN @data data2 ON data.row = data2.row + 1
WHERE data.name data2.name OR data2.name IS NULL
)
SELECT begindata.name, begindata.beginrow, enddata.endrow FROM BeginData JOIN EndData ON BeginData.sort = EndData.sort
ORDER BY BeginData.beginrow;
When testing, it did not scale too well on our server:
[font="Courier New"]
CpuMs LogRds Elapsed
100 rows:
0 883 10
500 rows:
125 3613 120
1000 rows:
484 7054 442
2000 rows:
1748 22339 1718
3000 rows:
3791 46001 4001
5000 rows:
10577 117645 10651
Etcetera...
[/font]
Did I do something wrong, or is it unwise to use a table variable for such operations? Or, eh..., both?
April 28, 2009 at 2:34 pm
Seems to me a series of procs called by a master proc would be better as that way if/when the data changes, you only have to add a new proc to test for a new error or delete a proc from the Master proc if that test is no longer viable. Saves having to rewrite the whole code if your data changes just slightly.
-- Kit
April 28, 2009 at 2:37 pm
It's just a way to populate a table/temp table/etc... with some values.
http://www.sqlservercentral.com/articles/TSQL/62867/
April 28, 2009 at 2:58 pm
gserdijn (4/28/2009)
This was what I intended for the problem on page 8 which happens to look much the same as Samuel Vella's solution.SET ROWCOUNT 5000;
SELECT IDENTITY(INT,0,1) AS id INTO #Tally FROM sysobjects a, sysobjects b;
SET ROWCOUNT 0;
DECLARE @data TABLE (row INT IDENTITY PRIMARY KEY, name CHAR(2));
INSERT INTO @data (name) SELECT CASE WHEN id % 10 < 5 THEN 'RL' ELSE 'SL' END
FROM #Tally ORDER BY id;
DROP TABLE #Tally;
WITH EndData (name, beginrow, endrow, sort ) AS
(SELECT data.name, 0, data.row, ROW_NUMBER () OVER (ORDER BY data.row)
FROM @data data LEFT JOIN @data data2 ON data.row = data2.row - 1
WHERE data.name data2.name OR data2.name IS NULL
),
BeginData (name, beginrow, endrow, sort ) AS
(SELECT data.name, data.row, 0, ROW_NUMBER () OVER (ORDER BY data.row)
FROM @data data LEFT JOIN @data data2 ON data.row = data2.row + 1
WHERE data.name data2.name OR data2.name IS NULL
)
SELECT begindata.name, begindata.beginrow, enddata.endrow FROM BeginData JOIN EndData ON BeginData.sort = EndData.sort
ORDER BY BeginData.beginrow;
When testing, it did not scale too well on our server:
[font="Courier New"]
CpuMs LogRds Elapsed
100 rows:
0 883 10
500 rows:
125 3613 120
1000 rows:
484 7054 442
2000 rows:
1748 22339 1718
3000 rows:
3791 46001 4001
5000 rows:
10577 117645 10651
Etcetera...
[/font]
Did I do something wrong, or is it unwise to use a table variable for such operations? Or, eh..., both?
Table variables are not a good choice when you have a large number of records. The query optimizer assumes that the number of records in a table variable is alsways 1 so you won't get an efficient query plan if you have a large number of records. Based on the query above, you probably should have used a properly indexed temporary table instead.
April 28, 2009 at 4:16 pm
[font="Verdana"]There's a point I'd like to raise.
When you are looking at something like a cursor and seeing how well it performs over a small number of rows, it's easy to think "well, that seems to be the best approach then."
There's an inherent assumption here: that your code will be the only code performing at the time. So how do cursors perform when you have 10 users running the same code at the same time? Or 100?
Just like many new database programmers use cursors (or equivalents) because they are familiar with procedural logic, so do they also assume that a test on an effectively single user system is enough.
When you are developing database code, you need to ask some questions of the code:
1. What is the life time of this code? If it's a throw away solution, then who cares. But if it's not, the chances are your code will be around for years. And if so, it's likely to be amended and reused.
2. What is the scalability of the code? In terms of data growth and parallelism and concurrency?
Against both of these questions, cursors fair poorly. So it's not enough to say "oh, I tried my cursor code with 100 rows, and it works fine thank you." What is the impact of introducing that code into a production system? Now, 1 year from now, 10 years from now?
That's why it's a good idea to learn the techniques that scale well. Get in the habit of using them. Practice and refine. Cursors are like comfort blankets... at some point they are best left behind.
[/font]
April 28, 2009 at 4:40 pm
Hey, you leave my blankey out of it... 😉
😎 Kate The Great :w00t:
If you don't have time to do it right the first time, where will you find time to do it again?
April 28, 2009 at 6:20 pm
Heaps of thanks,
I have a cursor inside a trigger, and I always thought Cursor are from the word 'Curse'.
I had a stored proc written with the C word which I converted, after reading your third part I'm going to deal with my Cursor in a Trigger problem.
once again, Heaps of Thanks, your place in heaven is guaranteed.
April 29, 2009 at 1:23 am
Hi Barry,
Been following your article and it really is nice...
I have a case though where I am not able to remove the cursor/ or iteration approach for that matter.. Can you suggest something?
I'll give a dummy scenario...
SHEMA INFORMATION:
We have the following tables in database (say : MasterDatabase)
1. a user table containing data about the users...
2. a project table listing the various projects
3. a link table say user_projects linking the users to projects on a time basis ( i.e. from 'date' to 'date')
Since the data is huge there are regular backups of the user's Performance data and stored in separate databases say "PerformanceHistoryDB1", "PerformanceHistoryDB2", ""PerformanceHistoryDB3" and so on...
When this backup occurs the user_projects table is updated to store the corresponding database reference in a column say BackupDBID in user_projects table for each row where the corresponding performance data can be found (for a user on a particular project within a time frame)
All these history databases have the same schema. In particular, there is a table
Performancedata within each history database that stores the rating and is mapped to the primary key 'user_projectsId' of the user_projects table in MasterDatabase ...
THE PROBLEM IS:
We need to get the corresponding ratings of all the users on particular project within a time frame.
Please note that each users data might be in a separate history database's Performance data table.
Currently we have used 2 -3 approaches for the same.
Approach 1:
Use a cursor to fetch the DBID for each row and then get the corresponding data from the Performance data table in that particular DB in each iteration
Approach 2:
Use separate queries ...1 for each database...and then return the union of these resultsets...The problem with this approach is that the number of history databases keep growing and would result in a maintenance hell each time needing to add the handling for the new history database.
Approach 3:
Use left join with the Performance data table of each database and use a case statement to check for null results and show the result from the database that actually returns results.
Obviously the problem would be the Perforamance/ efficiency issue. plus the problem associated with approach 2.
I really am in a dilemma.
Please help.
Thanks
Sharad
April 29, 2009 at 2:43 am
As I have been convinced yesterday that cursor should be eliminated whenever possible. I have another cursor which I would happy to get rid of if I can.
-- create the test table
IF OBJECT_ID('dbo.TestData') IS NOT NULL
DROP TABLE dbo.TestData
IF OBJECT_ID('dbo.FactRL') IS NOT NULL
DROP TABLE dbo.FactRL
IF OBJECT_ID('dbo.FactSL') IS NOT NULL
DROP TABLE dbo.FactSL
CREATE TABLE dbo.TestData
(Name varchar(4) not null,
RowNumber int not null,
DataID int not null,
CONSTRAINT [PK_TestData] PRIMARY KEY CLUSTERED (
[RowNumber] ASC
))
CREATE TABLE dbo.FactSL
( DataID int not null,
DataText varchar(100) not null
CONSTRAINT [PK_FactSL] PRIMARY KEY CLUSTERED (
[DataID] ASC))
CREATE TABLE dbo.FactRL
( DataID int not null,
DataTime datetime,
DataText varchar(100) not null
CONSTRAINT [PK_FactRL] PRIMARY KEY CLUSTERED (
[DataID] ASC))
--populate test data use tally table as suggested, also RL and SL's DataID can be identical
SET ROWCOUNT 100;
SELECT IDENTITY(int,101,1) AS id INTO #Tally FROM sysobjects a, sysobjects b;
SET ROWCOUNT 0;
INSERT INTO dbo.TestData(Name, RowNumber, DataID) SELECT CASE WHEN tally.id/10%2 = 0 THEN 'RL' ELSE 'SL' END, tally.id - 100, tally.id FROM #Tally tally;
INSERT INTO dbo.FactRL(DataID, DataTime, DataText) SELECT tally.id, GETDATE(), 'RL text value ' + CAST(tally.id - 100 AS VARCHAR) FROM #Tally tally
INSERT INTO dbo.FactSL(DataID, DataText) SELECT tally.id, 'SL text value ' + CAST(tally.id - 100 AS VARCHAR) FROM #Tally tally
DROP TABLE #Tally;
--prepare the used function solution provided by peter
create function dbo.TestNumberRange_nocursor1() returns table
as
return
(
with
lowerEdgesQ( SegmentID, Name, RowNumber ) as
(
select
row_number() over ( order by d.RowNumber asc )
, d.Name
, d.RowNumber
from
dbo.TestData as d
where
-- detects lower edge of each name
not exists( select 1 from dbo.TestData as i where i.Name = d.Name and i.RowNumber = d.RowNumber - 1 )
)
, upperEdgesQ( SegmentID, Name, RowNumber ) as
(
select
row_number() over ( order by d.RowNumber asc )
, d.Name
, d.RowNumber
from
dbo.TestData as d
where
-- detects upper edge of each name
not exists( select 1 from dbo.TestData as i where i.Name = d.Name and i.RowNumber = d.RowNumber + 1 )
)
select
lowerEdgesQ.Name
, lowerEdgesQ.RowNumber as StartNum
, upperEdgesQ.RowNumber as EndNum
from
lowerEdgesQ
inner join upperEdgesQ on upperEdgesQ.SegmentID = lowerEdgesQ.SegmentID
)
;
-- THE CURSOR I thought of use CASE statement, but then it seems to me it can not be used with join
declare @DataFact table
(
RowNumber int not null,
DataIDintnot null,
DataTimedatetimenull,
NameFromvarchar(4)not null,
DataText varchar(100) null
)
declare @DataRange table
(
TableFrom varchar(4)not null,
StartNumintnot null,
EndNumintnot null
)
declare @tableFrom varchar(4),
@startRowNo int,
@endRowNo int
insert into @DataRange (TableFrom,
StartNum ,
EndNum)
select * from dbo.TestNumberRange_nocursor1()
declare datarange_cursor cursor
for
select TableFrom, StartNum, EndNum from @DataRange
open datarange_cursor
fetch datarange_cursor
into @tableFrom, @startRowNo, @endRowNo
while (@@fetch_status = 0)
begin
if @tableFrom = 'RL'
insert into @DataFact (RowNumber,
DataID, DataTime, NameFrom, DataText)
select
td.RowNumber,
td.DataID,
rl.DataTime,
'RL' as NameFrom,
rl.DataText
fromTestData td
JOIN FactRL rl ON td.DataID = rl.DataID
wheretd.RowNumber between @startRowNo and @endRowNo
else if @tableFrom = 'SL'
insert into @DataFact (RowNumber,
DataID, DataTime, NameFrom, DataText)
select
td.RowNumber,
td.DataID,
NULL as DataTime,
'SL' as NameFrom,
sl.DataText
fromTestData td
JOIN FactSL sl ON td.DataID = sl.DataID
wheretd.RowNumber between @startRowNo and @endRowNo
fetch datarange_cursor into @tableFrom, @startRowNo, @endRowNo
end
close datarange_cursor
deallocate datarange_cursor
select * from @DataFactorder by RowNumber
Viewing 15 posts - 106 through 120 (of 316 total)
You must be logged in to reply to this topic. Login to reply