Optimizing a cursor based routine – Part 2

In the first article of this series, we started to deconstruct a stored procedure with the aim of improving the efficiency of it. Hopefully, you will have been able to absorb that information and are now eager to see more.

The number one reason I hear for creating a cursor routine is to allow for the conditional processing on a limited number of rows. In the original stored procedure, we have this simple condition:

IF(@CustomerType= 'I')BEGIN /* Individual */   SELECT @CustomerName =FirstName+' '+ LastName
     FROM Person.Contact
     JOIN Sales.Individual
       ON Contact.ContactID= Individual.ContactID
    WHERE Individual.CustomerID= @CustomerId
 END ELSE BEGIN /* Store */   SELECT @CustomerName =Name
     FROM Sales.Store
    WHERE CustomerID =@CustomerId
 END

The simplest way of replicating the above conditional logic within a set based statement is to use a LEFT JOIN. Then we can use a CASE statement to reproduce the IF condition if a complex series of conditions are required, or, as in this case, ISNULL/COALESCE. I will be first to admit that, perhaps, this is a very simplistic example and that it will not be possible in all cases. If, for instance, one branch of the code contains multiple complex and lengthy processes then you could be of the opinion that the cursor route is still the better option. Although it is very difficult, if not impossible, to speak for all scenarios on all systems, I would maintain that by using the above techniques and perhaps two or more UNION ALL’d statements would be a better option.

So our two our possible alternatives for the above code would be:

SELECT Customer.CustomerId,CustomerType,
ISNULL(Sales.Store.Name,Person.Contact.FirstName+' '+Person.Contact.LastName)
  FROM Sales.Customer
  LEFT JOIN Sales.Individual
    on Sales.Customer.CustomerID = Sales.Individual.CustomerID
  LEFT JOIN Person.Contact
    ON Sales.Individual.ContactID = Person.Contact.ContactID
  LEFT JOIN Sales.Store
    on Sales.Store.CustomerID = Sales.Customer.CustomerID

SELECT Customer.CustomerId,CustomerType,
       Person.Contact.FirstName+' '+Person.Contact.LastName as CustomerName
  FROM Sales.Customer
  JOIN Sales.Individual
    on Sales.Customer.CustomerID = Sales.Individual.CustomerID
  JOIN Person.Contact
    ON Sales.Individual.ContactID = Person.Contact.ContactID
 WHERE Sales.Customer.CustomerType = 'I'
 UNION ALL
SELECT Customer.CustomerId,CustomerType,
       Sales.Store.Name as CustomerName
  FROM Sales.Customer
  JOIN Sales.Store
    on Sales.Store.CustomerID = Sales.Customer.CustomerID
 WHERE Sales.Customer.CustomerType <> 'I'

As a side issue, your default choice when UNION’ing statements should be UNION ALL not UNION. The difference between the two is that UNION will give you a distinct result set, whereas UNION ALL will return all then data.

SELECT 1 as Col

UNION

SELECT 1

SELECT 1 as Col

UNION ALL

SELECT 1

If you execute the two above queries, with the UNION you get one row, in the other two.

This is obviously a trivial example but I have seen, on many occasions, the errors that can be caused by this issue. Consider these two queries that return the total sales (including archived) from the AdventureWorks database :

with cteUnion as
(
select ReferenceOrderID,Quantity,ActualCost from Production.TransactionHistory
union
select ReferenceOrderID,Quantity,ActualCost from Production.TransactionHistoryArchive
) 
select SUM(quantity * ActualCost)
 from  cteUnion

The result is $168,556,857.0589

with cteUnion as
(
select ReferenceOrderID,Quantity,ActualCost from Production.TransactionHistory
union all
select ReferenceOrderID,Quantity,ActualCost from Production.TransactionHistoryArchive
) 
select SUM(quantity * ActualCost)
 from  cteUnion

the result is $190,019,067.0956

If you had used UNION and not UNION ALL , you have just under- reported AdventureWork’s total sales by $28million!!

Not only is it wrong, but if you were to compare the execution plans , then the UNION is more expensive by a substantial margin, as the engine is having to do a lot of work to create a distinct list of data.

As I asserted in the previous article, there are very good reasons for using ITVF’s :

Firstly, code re-use, something that SQL Server (and the SQL language in general) has been historically poor at, is greatly improved. Secondly, by using them in a statement with CROSS APPLY code, readability is also improved. They also help performance, a full discussion about the that is out of scope here, but I would advise you to take a look at Simon Sabin’s session ‘High performance functions’ from SQL Bits 6 at http://sqlbits.com/Sessions/Event6/High_performance_functions.

The below procedure GetSalesReport_Optimize2, demonstrates this and now, since all our logic is contained within functions, we have been able to remove the cursor as well.

CREATE PROCEDURE GetSalesReport_Optimize2
 AS
 DECLARE @Today        DATE
 SELECT @Today = '20040724'-- GETDATE()
 CREATE TABLE #ReportResults
 (
 CustomerId  INTEGER NOT NULL,
 CustomerName NVARCHAR(255)NOT NULL,
 TotalDue MONEY NULL,
 ShippedTotal MONEY NULL,
 OrderDate DATE NULL,
 DueDate DATE NULL,
 ShipDate DATE NULL,
 CustomerMinOrderDate DATE NULL,
 CustomerMaxOrderDate DATE NULL,
 POrder1 VARCHAR(50)NULL,OrderTotal1 MONEY NULL,
 POrder2 VARCHAR(50)NULL,OrderTotal2 MONEY NULL,
 POrder3 VARCHAR(50)NULL,OrderTotal3 MONEY NULL,
 POrder4 VARCHAR(50)NULL,OrderTotal4 MONEY NULL
 )
 INSERT INTO #ReportResults
 SELECT customer.CustomerId,ISNULL(Store.Name,Contact.FirstName+' '+ Contact.LastName),
       SumTotal.CustomerTotalDue,SumTotal.ShippedTotal,
       OrderDate,DueDate,ShipDate,
       MinMaxOrder.CustomerMinOrderDate,
       MinMaxOrder.CustomerMaxOrderDate,
       PurchOrders.POrder1, PurchOrders.TotalDue1,
       PurchOrders.POrder2, PurchOrders.TotalDue2,
       PurchOrders.POrder3, PurchOrders.TotalDue3,
       PurchOrders.POrder4, PurchOrders.TotalDue4
 FROM Sales.Customer
 OUTER APPLY dbo.GetMinMaxOrderDates(Customer.CustomerID) MinMaxOrder
 OUTER APPLY dbo.GetTop4DueOrders(Customer.CustomerID) PurchOrders
 OUTER APPLY dbo.GetNextOrderDue(Customer.CustomerID,@ToDay) OrderDue
 OUTER APPLY dbo.SumCustomerTotals(Customer.CustomerId) SumTotal
  LEFT JOIN Sales.Individual
    ON Individual.CustomerID= Customer.CustomerID
  LEFT JOIN Person.Contact
    ON Contact.ContactID= Individual.ContactID
  LEFT JOIN Sales.Store
         ON Store.CustomerID    =Customer.CustomerID  
 SELECT * FROM #ReportResults

Version	1	2	3	4	5	Average	% of Orig
0	12,413	12,496	12,716	14,143	13,873	13,128
1	10,496	10,643	10,656	11,340	11,540	10,935	22
2	2,170	2,230	2,296	2,340	5,120	2,831	83

You can see that we have dramatically improved performance, a drop from 13.1Secs to 2.8Secs which is a whopping 78% improvement. The overhead was entirely due to using a cursor and the effect that procedural programming has on performance. If you were to examine the query plans now, you would find that there are multiple scans against the child tables. There is a general attitude of "scans are bad" but, in our case, since 100% of the data in those scans are required, the scans are quite effective. It is also pretty obvious that the temporary table is now superfluous and can be removed. This is what I have done in GetSalesReport_Optimize3:

CREATE PROCEDURE GetSalesReport_Optimize3
AS
DECLARE @Today        DATE
SELECT @Today = '20040724'-- GETDATE()
SELECT customer.CustomerId,
       ISNULL(Store.Name,Contact.FirstName+' '+Contact.LastName) AS CustomerName,
       SumTotal.CustomerTotalDue as TotalDue,
       SumTotal.ShippedTotalas ShippedTotal,
       OrderDate,DueDate,ShipDate,
       MinMaxOrder.CustomerMinOrderDate,
       MinMaxOrder.CustomerMaxOrderDate,
       PurchOrders.POrder1, PurchOrders.TotalDue1,
       PurchOrders.POrder2, PurchOrders.TotalDue2,
       PurchOrders.POrder3, PurchOrders.TotalDue3,
       PurchOrders.POrder4, PurchOrders.TotalDue4
 FROM Sales.Customer
 OUTER APPLY dbo.GetMinMaxOrderDates(Customer.CustomerID) MinMaxOrder
 OUTER APPLY dbo.GetTop4DueOrders(Customer.CustomerID) PurchOrders
 OUTER APPLY dbo.GetNextOrderDue(Customer.CustomerID,@ToDay) OrderDue
 OUTER APPLY dbo.SumCustomerTotals(Customer.CustomerId) SumTotal
  LEFT JOIN Sales.Individual
    ON Individual.CustomerID= Customer.CustomerID
  LEFT JOIN Person.Contact
    ON Contact.ContactID= Individual.ContactID
  LEFT JOIN Sales.Store
         ON Store.CustomerID    =Customer.CustomerID

Version	1	2	3	4	5	Average	% of Orig
0	12,413	12,496	12,716	14,143	13,873	13,128
1	10,496	10,643	10,656	11,340	11,540	10,935	83
2	2,170	2,230	2,296	2,340	5,120	2,831	22
3	2,076	2,083	2,123	2,126	2,620	2,206	17

Another small improvement on before, but every little bit helps.

We are now getting to a point where we should be happy with the performance, but there are still some issues that can be resolved. If you look at the execution plan of GetSalesReport_Optimize3, you will notice that we have 4 scans of SalesOrderHeader occurring. Reading a page of data, even if it is in cache, is an expensive operation and should be minimized. Each of these 4 scans are looking at the same data and that is generating an overhead that we should look at optimizing out.

In the below function, I have merged the functionality of GetNextOrderDue, GetMinMaxOrderDates, GetTop4DueOrders and SumCustomerTotals together. In the CTE called cteOrders, I have used the ROW_NUMBER() ranking function to create some data that will allow us to ‘pick’ only the rows that are actually required to emulate the logic that was contained within the GetNextOrderDue function.

In the function below, the ‘isDue’ column, which will inform us if the order due date is after today will be set to a ‘Boolean’ value;1 when True , 0 when false. The same calculation is used in the ‘RowN’ column as the partition by value, so that we get an incrementing value in DueDate order for both ‘due’ and ‘non-due’ orders. If you have followed the logic, you will see that there will be only one row where isDue =1 and RowN =1, that being the ‘earliest’ row that ‘isDue’. Using this knowledge we can use the technique of wrapping a case statement inside an aggregate function, like the pivot example before, to return the OrderDate, ShipDate and DueDate of that one individual row.

 CREATE FUNCTION GetTotalSumMinMaxAndNextOrderDates(@CustomerID INTEGER,@ToDay DATE)
 RETURNS TABLE
 AS
 RETURN
 (
 WITH cteOrders
 AS
 (
 SELECT OrderDate,
       DueDate,
       ShipDate,
       TotalDue,
       Status,
       PurchaseOrderNumber,
       CASE WHEN DueDate>= @Today THEN 1 ELSE 0 END AS  isDue,
       ROW_NUMBER() OVER (PARTITION BY CASE WHEN DueDate >= @Today
                                           THEN 1 ELSE 0 END    ORDER BY DueDate) AS Rown,
       ROW_NUMBER() OVER (ORDER BY TotalDue DESC) AS RowNDue
  FROM Sales.SalesOrderHeader
 WHERE CustomerID =@CustomerID
 )
SELECT MIN(OrderDate ) AS CustomerMinOrderDate,
       MAX(OrderDate)AS CustomerMaxOrderDate,
       MIN(CASE WHEN isDue = 1 AND Rown=1 THEN OrderDate END) AS OrderDate,
       MIN(CASE WHEN isDue = 1 AND Rown=1 THEN DueDate END) AS DueDate,
       MIN(CASE WHEN isDue = 1 AND Rown=1 THEN ShipDate END) AS ShipDate,
       SUM(TotalDue) as CustomerTotalDue,
       SUM(case when Status =5 then TotalDue else 0 end) as ShippedTotal,
       MIN(CASE WHEN RowNDue = 1 THEN PurchaseOrderNumber END) AS POrder1,
       MIN(CASE WHEN RowNDue = 1 THEN TotalDue END)AS TotalDue1,
       MIN(CASE WHEN RowNDue = 2 THEN PurchaseOrderNumber END) AS POrder2,
       MIN(CASE WHEN RowNDue = 2 THEN TotalDue END)AS TotalDue2,
       MIN(CASE WHEN RowNDue = 3 THEN PurchaseOrderNumber END) AS POrder3,
       MIN(CASE WHEN RowNDue = 3 THEN TotalDue END)AS TotalDue3,
       MIN(CASE WHEN RowNDue = 4 THEN PurchaseOrderNumber END) AS POrder4,
       MIN(CASE WHEN RowNDue = 4 THEN TotalDue END)AS TotalDue4
  FROM cteOrders
 )

Once the above function is used within the main procedure, we are left with a much cleaner procedure.

CREATE PROCEDURE GetSalesReport_Optimize4
 AS
 DECLARE @Today        DATE
 SELECT @Today = '20040724'-- GETDATE()
 SELECT Customer.CustomerID,ISNULL(Store.Name,Person.FirstName+' '+ Person.LastName)AS CustomerName,
       OrderDue.CustomerTotalDue as TotalDue,
       OrderDue.ShippedTotal,
       OrderDate,DueDate,ShipDate,
       OrderDue.CustomerMinOrderDate,
       OrderDue.CustomerMaxOrderDate,
       OrderDue.POrder1, OrderDue.TotalDue1,
       OrderDue.POrder2, OrderDue.TotalDue2,
       OrderDue.POrder3, OrderDue.TotalDue3,
       OrderDue.POrder4, OrderDue.TotalDue4
 FROM Sales.Customer
 OUTER APPLY dbo.GetTotalSumMinMaxAndNextOrderDates(Customer.CustomerID,@ToDay) OrderDue
  LEFT JOIN Person.Person
    ON Person.BusinessEntityID = Customer.PersonID
  LEFT JOIN Sales.Store
    on Store.BusinessEntityID = Customer.StoreID
 GO

And the timings are looking pretty clean too.

Version	1	2	3	4	5	Average	% of Orig
0	12,413	12,496	12,716	14,143	13,873	13,128
1	10,496	10,643	10,656	11,340	11,540	10,935	83
2	2,170	2,230	2,296	2,340	5,120	2,831	17
3	2,076	2,083	2,123	2,126	2,620	2,206	22
4	1,353	1,373	1,390	1,390	1,606	1,422	11

It is worth noting that, although this final step helped in this case, this is a classic case if “it depends”. There is no guarantee that a single read of a table will always be the more efficient over multiple index seeks at retrieving the data. Consider the following simple schema:

 Create table Customer
 (
 CustomerId integer not null primary key
 )
 go
 Create table CustomerOrders
 (
 CustomerId integer not null,
 TotalDue  money   not null
 )
 go
 create index idxCustomerOrders on CustomerOrders(CustomerId,TotalDue)

and populate them with some sample random data

insert CustomerOrders(CustomerId,TotalDue)
Select top(1000000)
       abs(checkSUM(newid() ))%1000,
       abs(checkSUM(newid() ))%10000
 from  sys.columns a
 cross join
       sys.columns b
 cross join
       sys.databases c
go
insert into Customer(CustomerId) select distinct CustomerId from CustomerOrders
go

You may safely assume that the fastest way to select the min and max TotalDue for each CustomerId would be to :

 select CustomerId,
       MIN(TotalDue),
       MAX(TotalDue)
  from CustomerOrders
 group by CustomerId

Execution plan for aggregate

A single scan of data of the entire table has been used. Each row in the table has been considered as a possible value for both min and max aggregates.

Naturally, to resolve the query we are only interested in two of the one thousand(ish) rows per customer, in fact we want only need two thousand of our one million rows to answer the query.

We can drive SQL Server to look at the minimum amount of data required by rewriting the query as:

Select CustomerID,
            MinTotalDue.TotalDue,
            MaxTotalDue.TotalDue
From customer cross apply (Select top (1) TotalDue
                       From customerOrders
                  Where CustomerOrders.CustomerId = Customer.CustomerId order by TotalDue) as MinTotalDue
Cross apply (Select top(1) TotalDue
                      From customerOrders
                   Where CustomerOrders.CustomerId = Customer.CustomerId order by totalDue desc) as MaxTotalDue

After executing both queries and comparing the costs the cross-applied query takes a third of the duration (200ms as opposed to the 600ms required for the min/max query).

This, at first, may seem counter-intuitive. Not only are we reading from 2 tables but we are hitting one of those tables twice. The reason for the improvement is that now we can fully exploit the index to provide us with the answer and therefore only look at 2 rows per customer rather than have to pick the required values after scanning though all possible values.

I hope that, if nothing else, these articles have given you some food for thought and that you now have a few more ideas on how to optimize your own poorly performing code.

Many thanks must go to the SQL Server community in general, but more specifically to those who have given their free time to proof read these articles.

Optimizing a cursor based routine – Part 2

Resources

Rate

Share

Categories

Share

Rate

Optimizing a cursor based routine – Part 2

Resources

Rate

Share

Categories

Share

Rate

Related content

Optimizing a cursor based routine – Part 1

A Normalization Primer

Working with email addresses in SQL Server!

Querying Responses

Storing Checkbox Responses as Integers