July 21, 2006 at 10:29 am
Mike C wrote:
"Very, very, very rarely is a cursor absolutely necessary. There is almost always a way to conquer T-SQL tasks using T-SQL, sans cursors. Over-reliance on cursors is a fallback position for many people who come from backgrounds in flat-file processing, and want to use SQL server as a flat-file parsing tool."
OK Mike and all you other cursor exterminators, do you have ideas on how to use a set based approach when a data set has to be distributed over 2 or more "categories" that overlap?
That question sounds pretty vague. The following real world example should make the problem more clear. I couldn't figure out how to solve it without a RBAR (Row By Agonizing Row) method (Thanks for the "Modenism", Jeff.):
Suppose you have a table of payables that need to be referenced by records in a table of payments. The payment table associates amounts from fund sources to the payable records. The business rule states that payments are made against fund source A until it is exhausted and then fund source B is used.
When a row or rows is inserted in the payable table, the insert trigger creates payment records based on that rule.
Here's simplified table definitions:
Payable Table(
PayableID int identity,
Amount money
)
Payment Table(
PaymentID int identity,
PayableID int,
AmountPaid money,
FundSourceID char(1)
)
Here's the insert trigger's pseudo code using a RBAR solution:
Create cursor on inserted payable records
While inserted payable records remain
If payable amount <= fund source A balance
Insert payment record with payable's ID, fund source A ID, and payable amount
Else
If fund source A balance > 0
Insert payment record with payable's ID, fund source A ID, and fund source A's balance
Insert payment record with payable's ID, fund source B ID, and payable amount - fund source A's balance
Else
Insert payment record with payable's ID, fund source B ID, and payable amount
End
Get next payable record
End Loop
In case this is not clear, here's some sample "data" to illustrate the business need:
Inserted Payable Records - These need payments from fund sources.
ID Amount
1 $10
2 $20
3 $30
Fund Sources - (Pay from Fund A first)
ID FundAllocation
A $15
B $1000
Payment Records - These are the payments inserted by the trigger
ID PayableID AmountPaid FundSourceID
1 1 $10 A
2 2 $5 A
3 2 $15 B
4 3 $30 B
You can see that payableID 2 had to be split across sources A and B. This is the part that confounds me when trying to create a set based solution.
Obviously there's error checking code elsewhere to prevent over drawing fund source B, etc. What I'm interested in is a set based solution for a "rollover" situation like this.
July 21, 2006 at 12:01 pm
It's a variation of the "knapsack" problem. This is a very old problem, and if you've ever spent time trying to squeeze one more thing into a suitcase for a vacation or business trip you've faced this problem. The knapsack problem is concerned with the most profitable and efficient way to fill a knapsack with items of differing values and/or proportions. This is a very common problem when trying to figure out the most efficient use of space in cargo containers, but a variation applies here. In this instance I'm filling the knapsack with pennies from two jars; we can assume all the pennies are the same proportions and value. After I assign the pennies from the jars to their respective knapsacks, I recombine them into larger results [dollars and cents].
Like I said cursors are used a lot more than they need to be, and they're overused mostly by people who can't wrap their head around set-based logic. Not to blame them though, we normally don't (consciously) think in set-based mathematical formulae, and that's reflected in most of the popular programming languages. We tend to think in a step-by-step procedural fashion, and set-based takes a while to get used to.
So anyway here you go, a set-based solution to your problem from Mike C the cursor exterminator:
CREATE TABLE #Payable (
PayableID INT NOT NULL IDENTITY(1,1) PRIMARY KEY,
Amount money
)
INSERT INTO #Payable (Amount) VALUES (10.00)
INSERT INTO #Payable (Amount) VALUES (20.00)
INSERT INTO #Payable (Amount) VALUES (30.00)
CREATE TABLE #Funds (
FundID CHAR(1),
Amount NUMERIC(10, 2)
)
INSERT INTO #Funds (FundID, Amount) VALUES ('A', 15.00)
INSERT INTO #Funds (FundID, Amount) VALUES ('B', 1000.00)
CREATE TABLE #Payment (
PaymentID INT NOT NULL IDENTITY(1,1) PRIMARY KEY,
PayableID INT,
AmountPaid NUMERIC(10, 2),
FundSourceID CHAR(1)
)
SELECT TOP 10000 IDENTITY(INT, 1, 1) AS Num
INTO #Numbers
FROM dbo.syscolumns a
CROSS JOIN dbo.syscolumns b
CROSS JOIN dbo.syscolumns c
CREATE TABLE #Dollars (Num NUMERIC(10, 2) NOT NULL PRIMARY KEY)
INSERT INTO #Dollars (Num)
SELECT CAST(n.Num AS NUMERIC(10, 2)) / 100.0
FROM #Numbers n
INSERT INTO #Payment (PayableID, FundSourceID, AmountPaid)
SELECT a.PayableID, a.FundID, COUNT(*) / 100.0
FROM
(
SELECT PayableID, FundID
FROM
(
SELECT p.PayableID,
(
SELECT COALESCE(SUM(Amount),0)
FROM #Payable p1
WHERE p1.PayableID < p.PayableID
) + #Dollars.Num AS Num
FROM #Dollars JOIN #Payable p ON (#Dollars.Num BETWEEN 0.01 AND p.Amount)
) Payable_Items
JOIN
(
SELECT f.FundID,
(
SELECT COALESCE(SUM(Amount),0)
FROM #Funds f1
WHERE f1.FundID < f.FundID
) + #Dollars.Num AS Num
FROM #Dollars JOIN #Funds f ON (#Dollars.Num BETWEEN 0.01 AND f.Amount)
) Fund_Items
ON Fund_Items.Num = Payable_Items.Num
) a
GROUP BY a.PayableID, a.FundID
SELECT *
FROM #Payment
UPDATE #Funds
SET Amount = Amount -
(
SELECT SUM(p.AmountPaid)
FROM #Payment p
WHERE #Funds.FundID = p.FundSourceID
)
SELECT *
FROM #Funds
DROP TABLE #Dollars
DROP TABLE #Numbers
DROP TABLE #Funds
DROP TABLE #Payment
DROP TABLE #Payable
July 21, 2006 at 12:57 pm
Wow! Look Mom, no cursors!
I'll have to expand my set-based thinking to include the imaginary set of "all" possible values. (The #Dollars table in your example.)
Thanks Mike!
July 21, 2006 at 1:16 pm
No problem The most useful invention ever created for converting procedural code to set-based code (IMHO) is the inner join to a basic numbers table. The #Dollars is just a decimal numbers table with two digits of precision, representing pennies; come to think of it, a better name for it probably would have been "#Pennies". It's pretty amazing what you can do with a numbers table once you start to get the hang of it.
July 21, 2006 at 1:29 pm
"So SQL Server has always had cursors because of its origins. But set based queries are orders of magnitude more efficient. I once re-wrote an entire order processing system originally written using cursors* to use set based queries. Prior to the re-write it would take approximately 1.5 - 2 hours to process 10,000 orders. Post re-write the same process took less than 2 minutes."
I've had similar experiences with replacing cursors. By and large, the code that I have changed from cursor to set-based has been 100's to 1000's of times faster. The few exceptions where the cursor was faster turned out to be issues with the way in which the tables were indexed. Once we resolved those issues, the cursor-based and set-based solutions both improved, but the set-based solutions ended up outperforming the cursors by a wide margin.
July 21, 2006 at 2:12 pm
"It's a variation of the "knapsack" problem. "
Darn...I was to to slow. Put I'll post my code anyway, as another variation of the same concept.
if object_ID('Payable') is not null drop table Payable
Create Table Payable (
ID int identity,
Amount int
)
Insert Payable values (10)
Insert Payable values (20)
Insert Payable values (30)
if object_ID('Source') is not null drop table Source
Create table Source (
ID char(1),
Amount int)
Insert Source values ('A', 15)
Insert Source values ('B', 35)
Insert Source values ('C', 100)
select p.ID as PayableID, s.ID as SourceID, (max(p.NumberID)-min(p.NumberID))+1 as Amount
From
(
select p1.ID, pn.*
From
(
select p2.ID, p2.Amount as maxAmount,
isnull((
select top 1 Amount
from Payable p3
where p3.ID < p2.ID
order by p3.ID desc
), 0)+1 as minAmount
from Payable p2
) p1
JOIN dbo.Number pn on pn.NumberID between p1.minAmount and p1.maxAmount
) as p
JOIN
(
select s1.ID, sn.*
From
(
select s2.ID, s2.Amount as maxAmount,
isnull((
select top 1 Amount
from Source s3
where s3.ID < s2.ID
order by s3.ID desc
), 0)+1 as minAmount
from Source s2
) s1
JOIN dbo.Number sn on sn.NumberID between s1.minAmount and s1.maxAmount
) as s on p.NumberID = s.NumberID
group by p.ID, s.ID
Signature is NULL
July 21, 2006 at 2:42 pm
my recollection was that SyBase added cursors in the early 90's to prevent it from being run out of business by Oracle. In the mid-80s I remember listening to a Sybase dude talk about how horrible cursors were and that no one using Sybase should ever need to use them. After getting laughed at for too long, Sybase changed its tune.
Now if you guys think that those who use cursors are nuts/stupid/incompetent (non-relational heavens forbid) for using cursors, go right ahead.
All it is is a tool. Those who categorically refuse to even consider using it are hopeless!!! You can join the guy who thinks using temp tables is a mortal sin!
July 21, 2006 at 6:02 pm
"All it is is a tool. Those who categorically refuse to even consider using it are hopeless!!! You can join the guy who thinks using temp tables is a mortal sin!"
If 99.99999% of your problems are 'nails', the tool you should probably be using is a hammer... not a hacksaw.
July 21, 2006 at 6:26 pm
I would not "categorically refuse to even consider using (cursors)". I would simply say that it should be a last resort. I can't think of any good reason why you would want to use a RAR solution when a set based solution is available. In more than 20 years of coding for a variety of databases I have yet to see a problem that demands a cursor. Occasionally (very very very rarely) you need to cut that nail with a hacksaw, It's good that cursors exist.
Ron Cicotte
ron.cicotte@gmail.com
Data Transformations, LLC
July 21, 2006 at 6:36 pm
It's important to remember to balance the cost of the time to run a routine versus the cost of the time to develop the routine.
If there's something you only expect to ever need to do once or twice, and if a cursor solution provides adequate speed for your needs, don't spend half a day figuring out a set-based solution (unless your job is such that learning the new techniques involved is viewed as a benefit in and of itself).
Also, there's the issue of maintainability: Will the next person to touch the code be able to follow the SQL you've put together? I've been explicitly asked to keep my solutions very simple, so as to avoid confusing others I've worked with. Even if the solution isn't necessarily optimal (as long as it's not too bad).
Of course, if you're going to use cursors, make them LOCAL STATIC; otherwise, performance can be truly horrible (I've seen precisely the sort of performance increases mentioned here in moving from cursors to set-based solutions, when adding LOCAL STATIC to a cursor declaration).
And yes, I think I did repeat myself, if you're reading through the whole thread. Sorry.
R David Francis
July 21, 2006 at 7:10 pm
There is always more than one way to solve a problem. While my personal bias is toward set based solutions for the many reasons posted here, RD Francis and others have pointed out that there are other things besides performance to consider.
The bottom line is know your toolsets and be careful how you use them. Remember that databases tend to grow and what works today and is "not too bad" may create a horrible bottleneck a few months down the road.
Since I tend to avoid cursors I cannot vouch for the contention that "adding LOCAL STATIC to a cursor declaration will produce the kind of performance increases that a sets-based solution would. I find it conceptually unconvincing though since it still requires RAR processing whereas a sets based solution changes values simultaneously. I would need to see a demonstration of a non-trivial example to be convinced.
Ron Cicotte
ron.cicotte@gmail.com
Data Transformations, LLC
July 21, 2006 at 8:41 pm
Ron says:
"Since I tend to avoid cursors I cannot vouch for the contention that "adding LOCAL STATIC to a cursor declaration will produce the kind of performance increases that a sets-based solution would."
Allow me to clarify: in the cases I'm referring to, it's entirely possible that a set-based solution would have been an even more significant improvement over the LOCAL STATIC cursor; I'm just saying that there was a significany jump in performance.
Also note that I can't think of any cases where I'm using a cursor to update values in an existing field; I'm pretty much always selecting data off into another table for future use.
For me, the big revelation with regular cursors happened when I had two processes running in parallel, one working on several thousand records in a very large table, the other (as it turned out) working on *four* records. Both operations were still going after several hours, and some WITH (NOLOCK) queries looking at the results showed that both were moving through the data at a proportional rate, and would have finished at roughly the same time!
R David Francis
July 22, 2006 at 12:08 am
Nice code... just a bit of food for thought, though... correlated sub-queries with triangular joins (a<b) can be as bad or worse than cursors on large tables. Not advocating cursors at all... just want everyone to be aware that correlated sub-queries are not the panacea they seem to be. They are a serious form of "RBAR" (Row By Agonizing Row) and can impact the code performance based on the number of rows either by (n-1)2/2 or even (n-1)2. For example... this seemingly harmless running total code appears to fly (if you want to call 1.25 seconds "flying") on 1,000 test records. Put it in production with as little as 10,000 records and it takes not 10, but 100 times longer at 121.2 seconds and the duration grows exponentially compared to the number of records... even a cursor will beat that!
SELECT t1.TransactionID,
t1.Amount,
(SELECT SUM(Amount)
FROM TestTable
WHERE TransactionID <= t1.TransactionID) AS 'Running Total'
FROM TestTable t1
Correlated subqueries are inherently not set-based because, as in the code above, the sub-query is executed once for each row in the outer query. Same thing can make user defined functions a nasty source of hidden RBAR.
--Jeff Moden
Change is inevitable... Change for the better is not.
July 22, 2006 at 2:09 pm
That's why I said "99.99999%" instead of 100%, in recognition that 0.00001% of the time you might find it necessary to pound your nails into the board with that hacksaw.
July 23, 2006 at 2:05 pm
Sometimes you can improve performance even more by using a properly indexed temp table to hold the intermediate results of the two queries instead of one large query with several correlated subqueries. The efficiency depends, in large part, on how well the query optimizer does its job.
As an example, I used a very similar (but more complex) query to create "product pull-lists" that match up AdventureWorks inventory on SQL Server 2005 (1,069 rows in the Production.ProductInventory table) to sales order details (121,317 rows in the Sales.SalesOrderDetail table). The query is a lot more complex since there are many different products that can be ordered, each must match up with the proper order detail rows, there are several orders to fill, and the products are pulled from several different bins. Here are the business rules I used to design this query:
I've implemented it as one query, with no explicit intermediate temp tables. The query I wrote takes about 6 seconds on my workstation. I haven't done a cursor-based version, but one could be created based on the business rules above for comparison.
Viewing 15 posts - 76 through 90 (of 296 total)
You must be logged in to reply to this topic. Login to reply