We know that we need to take special attention about NULL values while writing queries because they may lead us into writing incorrect query logics. NULLs may also affect the performance of aggregation function MAX() and MIN(). I have submitted this issue to Microsoft Connect. There are number of ways to get around it. I think the improvement can also be done at query engine level. Hope this can be fixed in next or future version of SQL Server. But for now, we need to know it and know how to get around it.
First, let’s produce the issue.
use tempdb go if object_id('test') is not null drop table test go create table test(id int, data char(7000) not null default(replicate('a', 7000))) go create clustered index id on test(id) go insert into test(id) values(null) go 100 insert into test(id) values(1) go select * from test -- return 101 rows
Now I create test table in a none unique index with 101 rows. Each row will take a data page in SQL Server. Now let’s execute
set statistics io on go select MAX(id) from test --(1 row(s) affected) --Table 'test'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Results from the query indicate that 2 logical reads take place for the execution. Query Engine performs 2 logical reads. That’s great. There is no performance issue at all. Now let’s do the same for MIN
select MIN(id) from test --(1 row(s) affected) --Table 'test'. Scan count 1, logical reads 103, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Notice the last result from previous query, it takes 103 logical reads! This is very significant – My table gets fully scanned. Think about this, if my table has billions of rows, scanning such table isn’t a fun at all. Full scan is what we should eliminate in our query in this case at least! Comparing the query plan of those two, they are exactly the same – index scan + stream aggregate. But why the performance is so different?
This happens when there are massive amount of NULLs in the index. Let’s assume our index is in ascending order. While calculating MAX value, SQL Server gets the root page of the index, navigates the B-Tree, finds the maximum value at the right side, and returns the first value from that end (the last value of the index). So it only needs to access 2 pages, root and leaf, in this case, to get the expected value. Although the query plan says table scan, SQL Server will actually not scan entire table since there is no need for that.
While calculating MIN value, SQL Server use the same algorithm, access the root page, navigates the B-Tree, and finds the minimum values at the left side of the tree. Before returning the minimum value, SQL Server has to verify the value to be return is not NULL. To do so, SQL Server scans the pages at leaf level through doubly linked chain. In this case, the first not null values is at the end of the link, so SQL Server scans entire index to reach the first not null value – entire table is scanned.
Obviously, the query processor can’t use NULL as an implicit predicate to utilize the B-Tree efficiently. To get around this issue, there are 2 kind of approaches.
- Change existing algorithm
- Use existing algorithm but add additional index
Change existing algorithm
This is sample, just add an WHERE clause to force the query engine to seek the index. For instance:
select MIN(id) from test where id is not null
Use existing algorithm but add additional index
This approach applies to the environment that when you are maintaining a third party software where code change will violate service agreements.
- Change/Add index with filter where id is not null
- create an index view where id is not null. The indexes of the view can be utilized even your query is not referencing to the index view.
This is brought you by John Huang, http://www.sqlnotes.info