Set based operations means you should put everything into a single statement, right?
Well, not really. People seem to think that having two queries is really bad, so when faced with logical gaps, they just cram them into the query they have. This is partly because SQL Server and T-SQL supports letting you do this, and it’s partly because it looks like a logical extension of code reuse to arrive at a query structure that supports multiple logic chains. However, let’s explore what happens when you do this on particular situation, a CASE statement in a GROUP BY clause.
You see this a lot because a given set of data may be needed in slightly different context by different groups within the company. Like many of my example queries, this could be better written. Like many of my example queries, it mirrors what I see in the wild (and for those following along at home, I’m using the WideWorldImporters database for tests now):
CREATE PROCEDURE dbo.InvoiceGrouping (@x INT) AS SELECT SUM(il.UnitPrice), COUNT(i.ContactPersonID), COUNT(i.AccountsPersonID), COUNT(i.SalespersonPersonID) FROM Sales.Invoices AS i JOIN Sales.InvoiceLines AS il ON il.InvoiceID = i.InvoiceID GROUP BY CASE WHEN @x = 7 THEN i.ContactPersonID WHEN @x = 15 THEN i.AccountsPersonID ELSE i.SalespersonPersonID END; GO
Running this for any given value above, 7, 15 or other, you’ll get the same execution plan, regardless of the column used in the GROUP BY. However, Parameter Sniffing is still something of a factor. When you group this data by SalesPersonID, you only get 10 rows back. This will be shown as the estimated number of rows returned if some value other than 7 or 15 is used as a parameter. However, this is always the plan:
You can click on that to expand it into something readable. We can eliminate the Parameter Sniffing from the equation if we want to by modifying the query thus:
CREATE PROCEDURE dbo.InvoiceGrouping_NoSniff (@x INT) AS DECLARE @x2 INT; SET @x2 = @x; SELECT SUM(il.UnitPrice), COUNT(i.ContactPersonID), COUNT(i.AccountsPersonID), COUNT(i.SalespersonPersonID) FROM Sales.Invoices AS i JOIN Sales.InvoiceLines AS il ON il.InvoiceID = i.InvoiceID GROUP BY CASE WHEN @x2 = 7 THEN i.ContactPersonID WHEN @x2 = 15 THEN i.AccountsPersonID ELSE i.SalespersonPersonID END; GO
However, except for some deviation on the estimated rows (since it’s averaging the rows returned), the execution plan is the same.
What’s the big deal right? Well, let’s break down the code into three different procedures:
CREATE PROCEDURE dbo.InvoiceGrouping_Contact AS SELECT SUM(il.UnitPrice), COUNT(i.ContactPersonID), COUNT(i.AccountsPersonID), COUNT(i.SalespersonPersonID) FROM Sales.Invoices AS i JOIN Sales.InvoiceLines AS il ON il.InvoiceID = i.InvoiceID GROUP BY i.ContactPersonID; GO CREATE PROCEDURE dbo.InvoiceGrouping_Sales AS SELECT SUM(il.UnitPrice), COUNT(i.ContactPersonID), COUNT(i.AccountsPersonID), COUNT(i.SalespersonPersonID) FROM Sales.Invoices AS i JOIN Sales.InvoiceLines AS il ON il.InvoiceID = i.InvoiceID GROUP BY i.SalespersonPersonID; GO CREATE PROCEDURE dbo.InvoiceGrouping_Account AS SELECT SUM(il.UnitPrice), COUNT(i.ContactPersonID), COUNT(i.AccountsPersonID), COUNT(i.SalespersonPersonID) FROM Sales.Invoices AS i JOIN Sales.InvoiceLines AS il ON il.InvoiceID = i.InvoiceID GROUP BY i.AccountsPersonID; GO
Interestingly enough, these three queries produce a nearly identical execution plan. The one big difference is the Compute Scalar operator that is used to generate a value for the Hash Match Aggregate is no longer in the query:
The same basic set of structures, scans against both tables, to arrive at the data. Cost estimates between the two plans are very different though, with the targeted queries having a much lower estimated cost.
Performance-wise, interestingly enough, the average execution time of the first query, only returning the 10 rows, is 157ms on average, while the query grouping directly on the SalesPersonID averages about 190ms. Now, the reads tell a slightly different story with 17428 on the generic query and 5721 on the specific query. So, maybe a server under load will see a significant performance increase. However, let’s deal with what we have in front of us and say that, at least for these tests, the catch-all GROUP BY query performs well.
Now let’s change the paradigm slightly. Let’s add an index:
CREATE INDEX TestingGroupBy ON Sales.Invoices (SalespersonPersonID);
Frankly, this isn’t a very useful index. However, after adding it, the execution plan for the InvoiceGrouping_Sales query changes. Instead of scanning the table, it’s now scanning the index. Despite recompiles and attempts to force it using hints, the original InvoiceGrouping query will not use this index. Duration of the InvoiceGrouping_Sales query drops to 140ms on average and the reads drop a little further to 5021. Getting an 11% increase on performance is a win.
This is a pretty simplified example, however, making the CASE statement more complex won’t improve performance or further assist the optimizer to make good choices. Instead of trying to cram multiple different logical groupings into a single query, a better approach would be to create the three new procedures that I did above, and make the original InvoiceGrouping procedure into a wrapping procedure that chooses which of the individual procedures to call. This way, if you do add indexes in support of each of the different possible groupings, you would realize a positive outcome in your performance.
Want to talk more about execution plans and query tuning?. In August, I’ll be doing an all day pre-con at SQLServer Geeks Annual Summit in Bangalore India.
I’m also going to be in Oslo Norway for a pre-con before SQL Saturday Oslo in September.
The post CASE Statement in GROUP BY appeared first on Home Of The Scary DBA.