April 29, 2009 at 3:39 am
min.li,
without writing a whole pile of SQL here's a few tips
1. union
union can be used to compile datesets
e.g.
select *
from TableA TA
inner join TableB TB
on TA.ID = TB.ID
where TA.Column = 'RL'
UNION ALL
select *
from TableA TA
inner join TableC TC
on TA.ID = TC.ID
where TA.Column = 'SL'
UNION ALL
select *
from TableA TA
inner join TableD TD
on TA.ID = TD.ID
where TA.Column = 'TL'
2. better use of the join criteria
Select TA.id, isnull(TB.Column, TC.column)
from TableA TA
left outer join TableB TB
on TA.ID = TB.ID
and TA.column = 'RL'
left outer join TableC TC
on TA.ID = TC.ID
and TA.column = 'SL'
3. Derived Tables (can also be done with a CTE)
Select *
from TableA TA
inner Joint (select 'RL' as dtype, ID, Column from TableB
union all
Select 'SL' as dtype, ID, Column from TableC) DT
on TA.ID = DT.ID
and TA.Column =DT.dtype
April 29, 2009 at 3:40 am
This query leads to an identical resultset, but I have no idea what you are trying to accomplish.
SELECT TestData.RowNumber, TestData.DataID, FactRL.DataTime,TestData.Name NameFrom, COALESCE(FactRL.datatext,factsl.datatext) DataText
FROM TestData
LEFT JOIN FactRL ON TestData.DataID = FactRL.DataID
LEFT JOIN FactsL ON TestData.DataID = FactsL.DataID
ORDER BY TestData.RowNumber
April 29, 2009 at 3:43 am
And, off course, the populating of the tables can be done without a LOOP.
SET ROWCOUNT 10000;
SELECT IDENTITY(INT,101,1) AS id INTO #Tally FROM sysobjects a, sysobjects b;
SET ROWCOUNT 0;
INSERT INTO dbo.TestData(Name, RowNumber, DataID) SELECT CASE WHEN tally.id/10%2 = 0 THEN 'RL' ELSE 'SL' END, tally.id - 100, tally.id FROM #Tally tally;
INSERT INTO dbo.FactRL(DataID, DataTime, DataText) SELECT tally.id, GETDATE(), 'RL text value ' + CAST(tally.id - 100 AS VARCHAR) FROM #Tally tally WHERE tally.id/10%2 = 0;
INSERT INTO dbo.FactSL(DataID, DataText) SELECT tally.id, 'SL text value ' + CAST(tally.id - 100 AS VARCHAR) FROM #Tally tally WHERE tally.id/10%2 0;
DROP TABLE #Tally;
Tally ho!:-D
April 29, 2009 at 3:53 am
SELECT TestData.RowNumber, TestData.DataID, FactRL.DataTime,TestData.Name NameFrom, COALESCE(FactRL.datatext,factsl.datatext) DataText
FROM TestData
LEFT JOIN FactRL ON TestData.DataID = FactRL.DataID
LEFT JOIN FactsL ON TestData.DataID = FactsL.DataID
ORDER BY TestData.RowNumber
I think there is a few issues with this solution:
In real scenario, Firstly, the DataID value could be same in the FactRL and FactSL table, I just fast populate the test data in the tables which seem have distinct value in two tables. Secondly, there are actually more than two FactXX table involved, I only list two for simplicity. Third, there are more fields merge from different tables than illustrated here, and field name are not granteed to be same from different source table (data type is same though).
April 29, 2009 at 4:07 am
Regarding the duplicate DataIDs: Well, then we can use a UNION.
SELECT TestData.RowNumber, TestData.DataID, FactRL.DataTime,TestData.Name NameFrom, FactRL.DataText
FROM TestData
JOIN FactRL ON TestData.DataID = FactRL.DataID
UNION
SELECT TestData.RowNumber, TestData.DataID, NULL ,TestData.Name NameFrom, factsl.datatext DataText
FROM TestData
JOIN FactSL ON TestData.DataID = FactsL.DataID
/* UNION other tables unless their amount varies */
ORDER BY TestData.RowNumber
And add some extra UNIONS in case of TL, UL, ...
April 29, 2009 at 4:18 am
gserdijn (4/29/2009)
Regarding the duplicate DataIDs: Well, then we can use a UNION.
SELECT TestData.RowNumber, TestData.DataID, FactRL.DataTime,TestData.Name NameFrom, FactRL.DataText
FROM TestData
JOIN FactRL ON TestData.DataID = FactRL.DataID
UNION
SELECT TestData.RowNumber, TestData.DataID, NULL ,TestData.Name NameFrom, factsl.datatext DataText
FROM TestData
JOIN FactSL ON TestData.DataID = FactsL.DataID
/* UNION other tables unless their amount varies */
ORDER BY TestData.RowNumber
And add some extra UNIONS in case of TL, UL, ...
If you test it with my new edited populate test data script, you will see that this solution is not correct. The join have to be based on the value of the TestData.Name. The number of records in return resultset should be 100 instead of 200.
April 29, 2009 at 4:21 am
Again, I don't see anything that needs a cursor, let me have a go at it!
April 29, 2009 at 5:19 am
Back...
First I modified one of the source tables after looking at the queries.
create nonclustered index ix_TestData_DataID on dbo.TestData ( DataID );
I also added one more result column to my previous function as it might come in handy at some point.
alter function dbo.TestNumberRange_nocursor1() returns table
as
return
(
with
lowerEdgesQ( SegmentID, Name, RowNumber ) as
(
select
row_number() over ( order by d.RowNumber asc )
, d.Name
, d.RowNumber
from
dbo.TestData as d
where
-- detects lower edge of each name
not exists( select 1 from dbo.TestData as i where i.Name = d.Name and i.RowNumber = d.RowNumber - 1 )
)
, upperEdgesQ( SegmentID, Name, RowNumber ) as
(
select
row_number() over ( order by d.RowNumber asc )
, d.Name
, d.RowNumber
from
dbo.TestData as d
where
-- detects upper edge of each name
not exists( select 1 from dbo.TestData as i where i.Name = d.Name and i.RowNumber = d.RowNumber + 1 )
)
select
lowerEdgesQ.SegmentID as SegmentID
, lowerEdgesQ.Name as Name
, lowerEdgesQ.RowNumber as StartNum
, upperEdgesQ.RowNumber as EndNum
from
lowerEdgesQ
inner join upperEdgesQ on upperEdgesQ.SegmentID = lowerEdgesQ.SegmentID
)
;
My conversion of the to be improved code min.li provided
;
with
segmentQ( Name, StartNum, EndNum ) as
(
select
tn.Name
, tn.StartNum
, tn.EndNum
from
dbo.TestNumberRange_nocursor1() as tn
/* Add this where if you do not want to process all segements every time.
where
tn.SegmentID between 1 and 5
*/
)
/* RL data */
, RLdataQ( NameFrom, RowNumber, DataID, DataTime, DataText ) as
(
select
s.Name
, t.RowNumber
, t.DataID
, f.DataTime
, f.DataText
from
segmentQ as s
inner join dbo.TestData as t on t.RowNumber between s.StartNum and s.EndNum
inner join dbo.FactRL as f on f.DataID = t.DataID
where
s.Name = 'RL'
)
/* SL data */
, SLdataQ( NameFrom, RowNumber, DataID, DataTime, DataText ) as
(
select
s.Name
, t.RowNumber
, t.DataID
, NULL
, f.DataText
from
segmentQ as s
inner join dbo.TestData as t on t.RowNumber between s.StartNum and s.EndNum
inner join dbo.FactSL as f on f.DataID = t.DataID
where
s.Name = 'SL'
)
/* The actual query */
select RowNumber, DataID, DataTime, NameFrom, DataText from SLdataQ
union all select RowNumber, DataID, DataTime, NameFrom, DataText from RLdataQ
order by
RowNumber
;
You can see I kept a lot of whitespace in the query to aid readability and put the partial results in separate common table expressions in order to keep the final query as clear as possible. The where clauses in the table specific queries are essential for speed as they cut down the number of to be processed rows before any joints take place. They thus have the same function as the IF constructs in the original code!
Now the results you all been waiting for 🙂
For 100 records as in the supplied test script:
/* cursor original by min.li */
CpuMs LogRds Elapsed
----------- -------------------- -----------
0 863 3
/* cursor free by peter */
CpuMs LogRds Elapsed
----------- -------------------- -----------
0 834 3
Not shocking, but now let’s take a look at the 10.000 row case!
/* cursor original by min.li */
CpuMs LogRds Elapsed
----------- -------------------- -----------
1672 37199 1694
/* cursor free by peter */
CpuMs LogRds Elapsed
----------- -------------------- -----------
125 3328 149
Oops, someone call 911 😉
This demonstrates that cursors really don't scale and that you have to test your solution with above expected data set sizes. It magnifies problems you would otherwise not detect. And don’t forget the original code was already partially optimized by removing the first cursor. Imagine what it would have been like with both cursors in there and 10000 rows in the source table!
April 29, 2009 at 5:36 am
Thanks a million, peter.
You enlightened me on CTEs. Although I have read a little on CTE, I haven't really used it so far. You let me see how powerful it is, especially on replacing cursors. I thought CTE was used for hierachy processing. I was convinced these two cursors was not replacable, as I was seeking approach out of CTE. Next time I see a cursor, I should think of CTE first.
April 29, 2009 at 5:55 am
Thanks min.li for the complement and as you I am happy with the outcome.
Be aware though that technically CTEs are not strictly required for this solution, they do however really improve the readability of code by clearly expressing individual parts of the query. This becomes especially apparent when having many parts requiring the same input. In such cases CTEs can cut out redundant SQL code resulting in a better expression of intent and improved readability.
Again, I am glad I could convince you to stay away from cursors.
April 29, 2009 at 6:25 am
min.li (4/29/2009)
If you test it with my new edited populate test data script, you will see that this solution is not correct. The join have to be based on the value of the TestData.Name. The number of records in return resultset should be 100 instead of 200.
Different input, different query.... 😉
I don't mean to be obnoxious, but this gives an identical resultset.
SELECT TestData.RowNumber, TestData.DataID,
CASE WHEN testdata.name = 'RL' THEN FactRL.DataTime END AS DataTime,
testdata.name,
CASE WHEN testdata.name = 'RL' THEN factrl.datatext ELSE factsl.datatext END AS DataText
FROM TestData
LEFT JOIN FactRL ON TestData.DataID = FactRL.DataID
LEFT JOIN FactsL ON TestData.DataID = FactsL.DataID
ORDER BY TestData.RowNumber
I will study peter's elegant solution tonight. Table function + CTE, learning every day!
April 29, 2009 at 7:17 am
Hi every body,
I had asked this on the 12 th page of this discussion. Seeing the active discussion, can I request you to post an answer on my query too? Please let me know incase further information (code/ queries etc ) is required.
[Quote]
"Hi Barry,
Been following your article and it really is nice...
I have a case though where I am not able to remove the cursor/ or iteration approach for that matter.. Can you suggest something?
I'll give a dummy scenario...
SHEMA INFORMATION:
We have the following tables in database (say : MasterDatabase)
1. a user table containing data about the users...
2. a project table listing the various projects
3. a link table say user_projects linking the users to projects on a time basis ( i.e. from 'date' to 'date')
Since the data is huge there are regular backups of the user's Performance data and stored in separate databases say "PerformanceHistoryDB1", "PerformanceHistoryDB2", ""PerformanceHistoryDB3" and so on...
When this backup occurs the user_projects table is updated to store the corresponding database reference in a column say BackupDBID in user_projects table for each row where the corresponding performance data can be found (for a user on a particular project within a time frame)
All these history databases have the same schema. In particular, there is a table
Performancedata within each history database that stores the rating and is mapped to the primary key 'user_projectsId' of the user_projects table in MasterDatabase ...
THE PROBLEM IS:
We need to get the corresponding ratings of all the users on particular project within a time frame.
Please note that each users data might be in a separate history database's Performance data table.
Currently we have used 2 -3 approaches for the same.
Approach 1:
Use a cursor to fetch the DBID for each row and then get the corresponding data from the Performance data table in that particular DB in each iteration
Approach 2:
Use separate queries ...1 for each database...and then return the union of these resultsets...The problem with this approach is that the number of history databases keep growing and would result in a maintenance hell each time needing to add the handling for the new history database.
Approach 3:
Use left join with the Performance data table of each database and use a case statement to check for null results and show the result from the database that actually returns results.
Obviously the problem would be the Perforamance/ efficiency issue. plus the problem associated with approach 2.
I really am in a dilemma.
Please help.
Thanks
Sharad "
[Unquote]
Thanks again
Sharad
April 29, 2009 at 7:39 am
I am happy to take a look at it, but right now I have to finish some actual work :(. In the meantime if you can make a test case with a functional correct structure and some fictual data and expected results, I am sure a few of us will jump on the challenge.
If the test case is large, then add it as a SQL script to your post. And tell us the version of SQL Sevrer you use, anything 2005 and later preferred 😉
Cheers
April 29, 2009 at 8:26 am
sharad sinha (4/29/2009)
Approach 2:Use separate queries ...1 for each database...and then return the union of these resultsets...The problem with this approach is that the number of history databases keep growing and would result in a maintenance hell each time needing to add the handling for the new history database.
The problem you seem to have with this particular solution is easily circumvented by using a view that does the union all and is used in all your queries. To me your case looks a lot like a classic partitioning situation with history stored in separate databases, where data is perfectly segregated by check constraint on a single attribute primary key.
Read up on partitioning, the old and to me only familiar method is to make tables as described above and then make a view based on that. You can then even insert data into the view and the proper table in the proper database will be updated thanks to the check constraint on the PK. The only downside I experienced with that is that you have to specify every attribute of the view/table when doing an insert on it.
You can also do a nightly rebuild of your view by a stored procedure. It can scan available databases, check their schema and if it obviously matches the one you seek add it to the view. Best is however to make this part of the deployment procedure that creates a new history database.
Again, partitioning is what you are looking for me thinks!
April 29, 2009 at 8:34 am
Lynn Pettis (4/27/2009)
greatheep (4/27/2009)
So, I have to agree with some of the thoughts I've seen on the board - namely, so far, the article(s) have not shown a case where I would ever have thought of using a cursor. At this point, I'm not convinced that cursors can ALWAYS be replaced by set-based processing, though I agree that they should be avoided.Just to stir the pot, let me present you with a situation in which we are using a cursor here at my company - a situation where I cannot see a useful way to eliminate the cursor.
So, we receive a data file from an external source on a regular basis. This file contains roughly 1.5 million rows, give or take. We load this file into a staging table, and then run a process on it. This process is a cursor, which opens a transaction, applies pieces of the data to various tables, and then marks the row as "complete" (by setting a field on the row) and commits the transaction. If any error is encountered while attempting to process the row, all the processing for that row is rolled back, the error message is written to an "error_msg" field on the row, and the row is marked as "in error". The cursor then moves on to the next row.
The purpose of this is that if a single row is in error, the specific problem is identified, and the rest of the data is not held up by the problem. The problems could be of various kinds - problems in the data file that was sent to us, rows or pages in the target table(s) locked by a user at the moment, etc. This could allow us to only have to reprocess a few rows, rather than the entire 1.5 million rows, which takes a while. Also, nailing down WHICH row caused the issue when processing as a set is VERY difficult.
So, the gauntlet is down - any suggestions on how to achieve the same benefits using set processing?
In summary, we want:
- If there is an error in a row, none of the various updates/inserts that are done should be kept - all should be rolled back.
- Any rows which have no error should process completely, committing changes to the database.
- We should be able to tell after the fact which rows processed completely, and which rows have an error, as well as the specific error each row experienced (they could be different for each row in error)
Unfortunately, it's a broken guantlet. We can't do anything without requirements, DDL for the tables (staging and destination), sample data (in a consumable format -- see the first article linked in my signature block below), processing rules for the data, expected results based on the sample data provided. If part of the "errors" is data already existing in the destination table(s) be sure to provide sample data for them as well.
greatheep, I'd be very interested to see the logic behind what you are doing and if you or anyone else can provide steps for improvement. We have similar needs in some of our financial/investment calculations in that we need to know exactly where something failed and require that the failure not affect other rows being processed.
Most of these stored procedures were written several years ago and the fact is that they work, although they may not work in the most efficient manner and do not scale too well either. Testing the results of a re-write may take several man months and our efforts may be be better spent optimizing the queries or table indexes within the nested(!) cursors rather than rewriting this as set-based logic.
If I had unlimited time and resources, a set-based rewrite would be the way to go.
Viewing 15 posts - 121 through 135 (of 316 total)
You must be logged in to reply to this topic. Login to reply