When presented with a poorly performing stored procedure, there can be a one single statement that is to be optimized. However, more often than not this is not the case. There will be many statements each of which may perform well in isolation; the issue is the amount of times that they are executed. It’s a “Death by a thousand cuts scenario”.
SQL is a language designed to work with large datasets and SQL Server works best when given instructions that exploit that. Procedural programming techniques such as cursors, looping, multi-line table valued functions (not to be confused with Inline table valued functions), and scalar UDF's, can limit the efficiency of a stored procedure. If you able to work around these, the results can be astounding.
Here, we have a relatively simple stored procedure that produces a sales report from the AdventureWorks database. I have created it echoing the common issues that I see posted on various forums. The code is designed to run against the 2008R2 release of adventure works. This can be obtained from http://msftdbprodsamples.codeplex.com/releases/view/55926 , the executable file name is AdventureWorks2008R2_SR1.exe.
CREATE PROCEDURE GetSalesReport AS DECLARE @SalesOrderID INTEGER DECLARE @Status TINYINT DECLARE @OrderDate DATE,@DueDateDATE,@ShipDateDATE DECLARE @CustomerID INTEGER DECLARE @CustomerMinOrderDate DATE DECLARE @CustomerMaxOrderDate DATE DECLARE @OrderCount INTEGER DECLARE @OrderTotal1 MONEY,@POrder1VARCHAR(50) DECLARE @OrderTotal2 MONEY,@POrder2VARCHAR(50) DECLARE @OrderTotal3 MONEY,@POrder3VARCHAR(50) DECLARE @OrderTotal4 MONEY,@POrder4VARCHAR(50) DECLARE @TotalDue MONEY,@PorderVARCHAR(50) DECLARE @ShippedTotal MONEY DECLARE @CustomerTotalDue MONEY DECLARE @CustomerName NVARCHAR(255) DECLARE @CustomerType NCHAR(1) DECLARE @Today DATE SELECT @Today = '20040724'-- GETDATE() would normally be used here. CREATE TABLE #ReportResults ( CustomerId INTEGER NOT NULL, CustomerName NVARCHAR(255)NOT NULL, TotalDue MONEY, ShippedTotal MONEY, OrderDate DATE NULL, DueDate DATE NULL, ShipDate DATE NULL, CustomerMinOrderDate DATE NULL, CustomerMaxOrderDate DATE NULL, POrder1 VARCHAR(50)NULL,OrderTotal1 MONEY NULL, POrder2 VARCHAR(50)NULL,OrderTotal2 MONEY NULL, POrder3 VARCHAR(50)NULL,OrderTotal3 MONEY NULL, POrder4 VARCHAR(50)NULL,OrderTotal4 MONEY NULL ) DECLARE CustomerCur CURSOR STATIC READ_ONLY FOR SELECT CustomerId,CustomerType FROM Sales.Customer OPEN CustomerCur WHILE(0=0)BEGIN FETCH NEXT FROM CustomerCur INTO @CustomerId,@CustomerType IF(@@FETCH_STATUS<>0)BREAK SELECT @OrderCount =1 SELECT @OrderDate =NULL, @DueDate =NULL, @ShipDate =NULL IF(@CustomerType ='I')BEGIN /* Individual */ SELECT @CustomerName =FirstName+' '+ LastName FROM Person.Contact JOIN Sales.Individual ON Contact.ContactID= Individual.ContactID WHERE Individual.CustomerID= @CustomerId END ELSE BEGIN /* Store */ SELECT @CustomerName =Name FROM Sales.Store WHERE CustomerID =@CustomerId END SELECT TOP(1) @OrderDate = OrderDate, @DueDate = DueDate, @ShipDate = ShipDate FROM sales.SalesOrderHeader WHERE CustomerID =@CustomerID AND DueDate>=@Today ORDER BY DueDate SELECT @CustomerMinOrderDate = MIN(OrderDate), @CustomerMaxOrderDate = MAX(OrderDate) FROM sales.SalesOrderHeader WHERE CustomerID =@CustomerID SELECT @CustomerTotalDue = SUM(TotalDue) FROM sales.salesorderheader WHERE CustomerID =@CustomerID SELECT @ShippedTotal =SUM(totalDue) FROM sales.SalesOrderHeader WHERE CustomerID =@CustomerID AND Status= 5 SELECT @OrderTotal1 =NULL, @POrder1 =NULL, @OrderTotal2 =NULL, @POrder2 =NULL, @OrderTotal3 =NULL, @POrder3 =NULL, @OrderTotal4 =NULL, @POrder4 =NULL SELECT @OrderCount =1 DECLARE PrevSales CURSOR STATICREAD_ONLY FOR SELECT PurchaseOrderNumber,TotalDue FROM sales.salesOrderHeader WHERE CustomerId = @CustomerID ORDER BY TotalDue DESC OPEN PrevSales FETCH NEXT FROM PrevSales INTO@POrder,@TotalDue WHILE(@@FETCH_STATUS =0)BEGIN IF(@OrderCount = 1 ) BEGIN SELECT @OrderTotal1 =@TotalDue,@POrder1= @POrder END IF(@OrderCount = 2 ) BEGIN SELECT@OrderTotal2 = @TotalDue,@POrder2= @POrder END IF(@OrderCount = 3 ) BEGIN SELECT @OrderTotal3 =@TotalDue,@POrder3= @POrder END IF(@OrderCount = 4 ) BEGIN SELECT @OrderTotal4 =@TotalDue,@POrder4= @POrder END FETCH NEXT FROM PrevSales INTO @POrder,@TotalDue SELECT@OrderCount=@OrderCount+1 END CLOSE PrevSales DEALLOCATE PrevSales INSERT INTO#ReportResults(CustomerId,CustomerName, TotalDue,ShippedTotal, OrderDate,DueDate,ShipDate, CustomerMinOrderDate,CustomerMaxOrderDate, POrder1,POrder2,POrder3,POrder4, OrderTotal1,OrderTotal2,OrderTotal3,OrderTotal4) VALUES(@CustomerId,@CustomerName, @CustomerTotalDue,@ShippedTotal, @OrderDate,@DueDate,@ShipDate, @CustomerMinOrderDate,@CustomerMaxOrderDate, @POrder1,@POrder2,@POrder3,@POrder4, @OrderTotal1,@OrderTotal2,@OrderTotal3,@OrderTotal4) END CLOSE CustomerCur DEALLOCATE Customercur SELECT * FROM #ReportResults
Here are the timings, in milliseconds, of 5 runs of the above procedure.
ProcedureVersion | Run 1 | Run 2 | Run 3 | Run 4 | Run 5 | Average Ms |
0 | 12,413 | 12,496 | 12,716 | 14,143 | 13,873 | 13,128 |
If we were to analyse each individual statement here, we would conclude that we could potentially add an index or two onto SalesOrderHeader to support the order by clauses and to remove bookmark lookups and save some resources there.
But, is there anything else that can be done? As stated before, this routine is simply doing far too much work.
In a procedural language such as C#, VB , C etc.., this kind of looping process is the simplest way for humans to relate multiple sets of data, and it is ingrained into our ''developmental' thoughts very early in our careers. There is nothing wrong, with this per-se but within SQL we have much bigger and better tools readily available to relate datasets and we have to recode our SQL into a state which will enable the optimizer to fully utilize ( and/or even consider) its full range of functionality. Dare I quote Yoda? “You must unlearn, what you have learned”. It’s a bit of a cliché but a statement that does apply very well here.
By using a procedural technique, not only is the order of the data operations hard coded but so has the data-access method that SQL Server will be using. Every operation in the code example is effectively a nested loop operation, which are only generally effective for small datasets, but not large ones. ‘Small’ and ‘large’ are relative terms and rather than a strict number of rows being defined as large or small, we are talking about the relationships between tables. If we are joining 100% of two tables together then a hash or merge join would be more appropriate. Not only that, but parallelism cannot be considered as a possible cost/time saving measure when such looping comes into play.
I would visualise this as running a haulage company and only carrying one package in your van at any one time. To run an efficient operation you need to fully load your van AND plan an efficient route around your drop-off/collection points. If you don’t valuable time and money will be wasted.
So, starting from the inside and working out, let’s see how much of the code we can change / replace to create a better performing set based solution.
The first piece of code to be addressed is:
SELECT @CustomerTotalDue = SUM(TotalDue) FROM sales.salesorderheader WHERE CustomerID =@CustomerID SELECT @ShippedTotal = SUM(totalDue) FROM sales.SalesOrderHeader WHERE CustomerID =@CustomerID AND Status= 5
What we have here are two very similar pieces of code reading from the same table, Sales.SalesOrderHeader, the only difference being the filter of “Status =5” to sum up the TotalDue of the orders that have shipped.
Note that at present all the orders in the table have a status of 5, so that”Select count(*) from SalesOrderHeader” and “select count(*) from SalesOrderHeader” where status =5 give the same result. This means that @CustomerTotalDue and @ShippedTotal will have the value.
I’m of the opinion that a piece data should be read only once and every read should be for a reason. In the code snippets above, we are instigating two reads of data when one would be perfectly adequate. Generally speaking, we should be concerned with minimizing the amount of data read as THAT is the most expensive operation in that code even if the data is held within cache. To minimise the amount of IO used by the queries, we shall combine the two SUM operations into a single query. Instead of filtering the rows where status =5 in the WHERE clause, a case statement will be used in the sum aggregate to meet the same needs. Additionally, for reasons that we shall see at a later point, we will create an inline table valued function (ITVF).
CREATE FUNCTION SumCustomerTotals(@CustomerId INTEGER) RETURNS TABLE AS RETURN ( SELECT CustomerTotalDue = SUM(TotalDue), ShippedTotal =SUM(CASE WHEN status =5 THEN TotalDue ELSE 0 END) FROM sales.salesorderheader WHERE CustomerID =@CustomerID )
Imagine the scenario where this is the only code executed; We have doubled the performance of our system in one easy stroke.
Using a case statement within an aggregate can be a shortcut for multiple count aggregates, as well. Consider the scenario where we wish to count the total number of rows for a specific customer and the number of rows for that same customer that meet a certain condition, say where TotalDue >= 50; Traditionally we would write code such as:
SELECT CustomerID, COUNT(*) AS TotalCount FROM sales.SalesOrderHeader WHERE CustomerID = 11185 GROUP BY CustomerID SELECT CustomerID, COUNT(*) AS HighCost FROM sales.SalesOrderHeader WHERE CustomerID = 11185 AND TotalDue>= 50 GROUP BY CustomerID
Once again two operations are used where we can use the case statement to help combine them into one. By wrapping a case expression that will return 1 if true and 0 if false and then wrapping that within a sum aggregate expression, we can bring the two separate table lookups together into one.
select CustomerID, COUNT(*) as TotalCount, SUM(case when TotalDue>=50 then 1 else 0 end)as HighCost from sales.SalesOrderHeader where CustomerID = 11185 group by CustomerID
But, I digress.
To execute the SumCustomerTotals function that we built previously, call it like this :
Select @NonSpecialOfferTotal= NonSpecialOfferTotal, @Total =Total from dbo.SumCustomerTotals(@SalesOrderId)
A similar technique can be applied to the PrevSales cursor loop. Here, we wish to assign the purchase order number and the total amount of the 4 orders with the highest total due into 4 sets of variables.
DECLARE PrevSales CURSOR STATICREAD_ONLY FOR SELECT PurchaseOrderNumber,TotalDue FROM sales.salesOrderHeader WHERE CustomerId = @CustomerID ORDER BY TotalDue DESC OPEN PrevSales FETCH NEXT FROM PrevSales INTO@POrder,@TotalDue WHILE(@@FETCH_STATUS =0)BEGIN IF(@OrderCount = 1 ) BEGIN SELECT @OrderTotal1 =@TotalDue,@POrder1= @POrder END IF(@OrderCount = 2 ) BEGIN SELECT @OrderTotal2 =@TotalDue,@POrder2= @POrder END IF(@OrderCount = 3 ) BEGIN SELECT @OrderTotal3 =@TotalDue,@POrder3= @POrder END IF(@OrderCount = 4 ) BEGIN SELECT @OrderTotal4 =@TotalDue,@POrder4= @POrder END FETCH NEXT FROM PrevSales INTO @POrder,@TotalDue SELECT@OrderCount=@OrderCount+1 END CLOSE PrevSales DEALLOCATE PrevSales
What the user has attempted to achieve is a pivot operation. Personally, I have found the PIVOT operator itself quite cumbersome, not to mention the fact that a single PIVOT statement does not support multiple pivoted columns, and so I prefer a “Manual pivot” which is also known as a “Cross Tab”.
Firstly we need a value to pivot by. If the data for such a value does not already exist within the table(s) we’re pivoting, as it does not in this case, we can create it by using the ROW_NUMBER() ranking function. We can then use this new value in a case statement to assign a specific value to a specific column, and then utilize the MIN aggregate function to create a single row to be assigned to the variables. Again, we shall create an Inline table valued function
CREATE FUNCTION GetTop4DueOrders(@CustomerId INTEGER) RETURNS TABLE AS RETURN ( WITH cteTop4 AS ( SELECT TOP(4) PurchaseOrderNumber, TotalDue, ROW_NUMBER()OVER(ORDER BY TotalDue DESC)AS RowN FROM sales.salesOrderHeader WHERE CustomerId = @CustomerID ORDER BY TotalDue DESC ) SELECT MIN(CASE WHEN RowN = 1 THEN PurchaseOrderNumber END)AS POrder1, MIN(CASE WHEN RowN = 1 THEN TotalDue END)AS TotalDue1, MIN(CASE WHEN RowN = 2 THEN PurchaseOrderNumber END)AS POrder2, MIN(CASE WHEN RowN = 2 THEN TotalDue END)AS TotalDue2, MIN(CASE WHEN RowN = 3 THEN PurchaseOrderNumber END)AS POrder3, MIN(CASE WHEN RowN = 3 THEN TotalDue END)AS TotalDue3, MIN(CASE WHEN RowN = 4 THEN PurchaseOrderNumber END)AS POrder4, MIN(CASE WHEN RowN = 4 THEN TotalDue END)AS TotalDue4 FROM ctetop4 ) GO
The statement to get the min and max order dates for the Customer can, for the time being, be simply wrapped inside an ITVF, as well.
Create Function GetMinMaxOrderDates(@CustomerIDinteger) returns table as return ( Select CustomerMinOrderDate = MIN(OrderDate), CustomerMaxOrderDate = MAX(OrderDate) from sales.SalesOrderHeader where CustomerID =@CustomerID ) Once these functions are folded into the procedure we are left with: CREATE PROCEDURE GetSalesReport_Optimize1 AS DECLARE @OrderDate DATE,@DueDateDATE,@ShipDateDATE DECLARE @CustomerID INTEGER DECLARE @CustomerMinOrderDate DATE DECLARE @CustomerMaxOrderDate DATE DECLARE @CustomerName NVARCHAR(255) DECLARE @CustomerType NCHAR(1) DECLARE @OrderCount INTEGER DECLARE @OrderTotal1 MONEY,@POrder1VARCHAR(50) DECLARE @OrderTotal2 MONEY,@POrder2VARCHAR(50) DECLARE @OrderTotal3 MONEY,@POrder3VARCHAR(50) DECLARE @OrderTotal4 MONEY,@POrder4VARCHAR(50) DECLARE @TotalDue MONEY,@PorderVARCHAR(50) DECLARE @CustomerTotalDue MONEY,@ShippedTotal MONEY DECLARE @Today DATE SELECT @Today = '20040724'-- GETDATE() CREATE TABLE #ReportResults ( CustomerId INTEGER NOT NULL, CustomerName NVARCHAR(255)NOT NULL, TotalDue MONEY NULL, ShippedTotal MONEY NULL, OrderDate DATE NULL, DueDate DATE NULL, ShipDate DATE NULL, CustomerMinOrderDate DATE NULL, CustomerMaxOrderDate DATE NULL, POrder1 VARCHAR(50)NULL,OrderTotal1 MONEY NULL, POrder2 VARCHAR(50)NULL,OrderTotal2 MONEY NULL, POrder3 VARCHAR(50)NULL,OrderTotal3 MONEY NULL, POrder4 VARCHAR(50)NULL,OrderTotal4 MONEY NULL ) DECLARE CustomerCur CURSOR FOR SELECT Customer.CustomerId,CustomerwType FROM Sales.Customer OPEN CustomerCur WHILE(0=0)BEGIN FETCH NEXT FROM CustomerCur INTO @CustomerId,@CustomerType IF(@@FETCH_STATUS<>0)BREAK SELECT @OrderDate =NULL, @DueDate =NULL, @ShipDate =NULL IF(@CustomerType ='I')BEGIN /* Individual */ SELECT @CustomerName =FirstName+' '+ LastName FROM Person.Contact JOIN Sales.Individual ON Contact.ContactID= Individual.ContactID WHERE Individual.CustomerID= @CustomerId END ELSE BEGIN /* Store */ SELECT @CustomerName =Name FROM Sales.Store WHERE CustomerID =@CustomerId END SELECT @OrderDate =OrderDate, @DueDate =DueDate, @ShipDate =ShipDate FROM dbo.GetNextOrderDue(@CustomerId,@ToDay) SELECT @CustomerMinOrderDate = CustomerMinOrderDate, @CustomerMaxOrderDate = CustomerMaxOrderDate FROM dbo.GetMinMaxOrderDates(@CustomerID) SELECT @OrderTotal1 =TotalDue1,@POrder1= POrder1, @OrderTotal2 = TotalDue2,@POrder1= POrder2, @OrderTotal3 = TotalDue3,@POrder1= POrder3, @OrderTotal4 = TotalDue4,@POrder1= POrder4 FROM dbo.GetTop4DueOrders(@CustomerID)PurchOrders SELECT @CustomerTotalDue = CustomerTotalDue, @ShippedTotal = ShippedTotal FROM dbo.SumCustomerTotals(@CustomerId) INSERT INTO#ReportResults(CustomerId,CustomerName,TotalDue,ShippedTotal, OrderDate,DueDate,ShipDate, CustomerMinOrderDate,CustomerMaxOrderDate, POrder1,POrder2,POrder3,POrder4, OrderTotal1,OrderTotal2,OrderTotal3,OrderTotal4) VALUES(@CustomerId,@CustomerName,@CustomerTotalDue,@ShippedTotal, @OrderDate,@DueDate,@ShipDate, @CustomerMinOrderDate,@CustomerMaxOrderDate, @POrder1,@POrder2,@POrder3,@POrder4, @OrderTotal1,@OrderTotal2,@OrderTotal3,@OrderTotal4) END CLOSE CustomerCur DEALLOCATE CustomerCur SELECT * FROM #ReportResults
Now things are starting to look a bit cleaner. Let’s see how the timings are shaping up:
Version | 1 | 2 | 3 | 4 | 5 | Average | % of Orig |
0 | 12,413 | 12,496 | 12,716 | 14,143 | 13,873 | 13,128 | |
1 | 10,496 | 10,643 | 10,656 | 11,340 | 11,540 | 10,935 | 83% |
Already a noticeable 17% drop in execution time.
That’s as far as I want to take it for the first half of this series. Next time we shall continue the process of unpicking the report and creating alternative code to improve performance.
Many thanks must go to the Sql Server community in general, but more specifically to those who have given their free time to proof read these articles.