.. Is there something I should know, sang Duran Duran many moons ago when I was young and before I knew that query optimization was even a thing. But, it is quite a handy little line to keep in mind when you are tuning queries, giving SQL server a little bit more explicit information about your data can go a long way to removing the work that is done to guarantee something about the data.
The classic case for this is adding constraints, for example :
1 | select * from Production.Product where DaysToManufacture < 0 |
SQL Server doesn’t need to read any of the table data. This is because there is a constraint on the table guaranteeing that there are no rows <0 and therefore SQL Server can safely replace that table read with a constant scan.
So how else can this be exploited?
Well, im working on a large datawarehouse project and one of the issues we have had is poor query against the datawarehouse to process the cubes. Imagine we are using the query of :
1 2 3 4 5 6 | select dimCustomer.CustomerKey,dimCustomer.Title,dimCustomer.FirstName,dimCustomer.LastName, FactInternetSales.SalesOrderNumber,FactInternetSales.SalesOrderLineNumber, factInternetSales.ProductKey,UnitPrice from dimCustomer join factInternetSales on dimCustomer.CustomerKey = factInternetSales.CustomerKey |
in a DSV. Now SSAS has a high propensity, I don’t know if it always does this, to wrap queries in a subselect and use DISTINCT. So our query would come something like this :
1 2 3 4 5 6 7 8 | Select distinct * from ( select dimCustomer.CustomerKey,dimCustomer.Title,dimCustomer.FirstName,dimCustomer.LastName, FactInternetSales.SalesOrderNumber,FactInternetSales.SalesOrderLineNumber, factInternetSales.ProductKey,UnitPrice from dimCustomer join factInternetSales on dimCustomer.CustomerKey = factInternetSales.CustomerKey ) as Data |
The plan hasn’t changed, both look like this :
That is because the primary key of both tables are being returned and that guarantees the DISTINCT , if either key column is removed from the select then SQL Server will have to sort to make the DISTINCT.
1 2 3 4 5 6 7 8 9 10 | Select distinct * from ( select dimCustomer.CustomerKey,dimCustomer.Title,dimCustomer.FirstName,dimCustomer.LastName, /* FactInternetSales.SalesOrderNumber,FactInternetSales.SalesOrderLineNumber, */ factInternetSales.ProductKey,UnitPrice from dimCustomer join factInternetSales on dimCustomer.CustomerKey = factInternetSales.CustomerKey ) as Data option(maxdop 1) – Inhibit parallelism for simplicity |
Yes, we are asking SQL Server for two different sets of data, they are not logically equivalent queries, but sometimes the data is implicitly unique and still SQL server has to do work to guarantee that uniqueness.
Let us further imagine that Unknown members are dealt with buy setting the customer key to -1 in factInternetSales BUT the member -1 itself is not in the dimCustomer table and is dealt with by a view. This is quite a bad practice but that never stops this sort of stuff occurring in the real world, so bear with me..
1 2 3 4 5 6 7 | Drop view vwCustomer go Create View vwCustomer as Select CustomerKey,Title,FirstName,LastName from dimCustomer union Select -1,null,null,'Unknown' |
First thing to note is that UNION is used, UNION tries to guarantee a unique set of rows and therefore extra work will have to be employed by SQL Server to do that work even on a simple select * from vwCustomer. The recommendation would be to use UNION ALL, we implicitly know that the data will be unique so we can save SQL Server the bother..
1 2 3 4 5 6 7 | Drop view vwCustomer go Create View vwCustomer as Select CustomerKey,Title,FirstName,LastName from dimCustomer union all Select -1,null,null,'Unknown' |
Now we plumb that into our DSV…
1 2 3 4 5 | select dimCustomer.CustomerKey,dimCustomer.Title,dimCustomer.FirstName,dimCustomer.LastName, FactInternetSales.SalesOrderNumber,FactInternetSales.SalesOrderLineNumber, factInternetSales.ProductKey,UnitPrice from vwCustomer dimCustomer join factInternetSales on dimCustomer.CustomerKey = factInternetSales.CustomerKey |
Which if we run in isolation is just peachy. However let us wrap this in DISTINCT as SSAS would:
1 2 3 4 5 6 7 8 9 | Select distinct * from ( select dimCustomer.CustomerKey,dimCustomer.Title,dimCustomer.FirstName,dimCustomer.LastName, FactInternetSales.SalesOrderNumber,FactInternetSales.SalesOrderLineNumber, factInternetSales.ProductKey,UnitPrice from vwCustomer dimCustomer join factInternetSales on dimCustomer.CustomerKey = factInternetSales.CustomerKey ) as Data option(maxdop 1) |
Oh, we now have a sort. This will be REALLY heavy on a regular DW load, ordering millions of rows is not quick.
So, what is our solution ?, lets tell SQL Server what our implied knowledge of the data is to make it explicit.
1 2 3 4 5 6 7 8 | Drop view vwCustomer go Create View vwCustomer as Select CustomerKey,Title,FirstName,LastName from dimCustomer where CustomerKey >0 union all Select -1,null,null,'Unknown' |
In all intents and purposes adding CustomerKey >0 does nothing, there are no rows <0 nothing will be filtered and this is the exact point. This is implied knowledge, we have now explicitly given SQL Server that same knowledge.
So what happens when we use that view ?
Great no sorting
So that deals with this as a one-off ? Lets let SQL Server know that this is the same for all queries of the table not just for this view.
1 | alter table dimCustomer add Constraint ChkSk check(CustomerKey>0) |
We are now guaranteeing that all CustomerKeys are positive, so if we back up a bit and use the view definition of :
1 2 3 4 5 6 7 | Drop view vwCustomer go Create View vwCustomer as Select CustomerKey,Title,FirstName,LastName from dimCustomer union all Select -1,null,null,'Unknown' |
And then run :
1 2 3 4 5 6 7 8 9 10 | Select distinct * from ( select dimCustomer.CustomerKey,dimCustomer.Title,dimCustomer.FirstName,dimCustomer.LastName, FactInternetSales.SalesOrderNumber,FactInternetSales.SalesOrderLineNumber, factInternetSales.ProductKey,UnitPrice from vwCustomer dimCustomer join factInternetSales on dimCustomer.CustomerKey = factInternetSales.CustomerKey ) as Data option(maxdop 1) |
Once again, the DISTINCT is explicitly guaranteed by the schema, no extra operations are required by SQL Server to prove that J
So, next time you are query tuning, remember that adding a little extra ( and sometimes seemingly superfluous code) can go a long way.
To paraphrase a bit
Please please tell me now, is there something you can say, to make that operator so away….