March 24, 2011 at 1:15 pm
Hi,
In order to speed up a "select distinct col1" from a 150M-records-table, I created an index on that column (varchar(3)).
I was surprised to see that the select distinct on that now-indexed-column was still taking 20sec (which is better than the almost 10min without the index but is still disappointing).
My understanding of b-tree structure of indexes was that it would held on the highest page level the list of the distinct values along with the pointers to the lower level pages, and on those lower level pages the location in the table of each and every occurence of that value.
Because in this situation I have a very low number of distinct values (21), I believed that SQL Server would only need to read the very first page of the index to be able to provide me with the distinct values and that it would therefore be instant.
I'm interested in understanding why it is not.
Regards,
Florent
March 24, 2011 at 1:27 pm
If you do some research into B-tree's you will find that not every distinct key value exists in the index levels - except at the leaf page level. If you look at the query plan my guess is that it is now doing an index scan rather than a table/clustered index scan.
jg
March 24, 2011 at 2:05 pm
What type of index did you create (clustered or non-clustered)? If it's the clustered index you still perform a table scan...
As a side note: why do you use VARCHAR(3) instead of CHAR(3)? Remember, VARCHAR needs two additional bytes... I'd expect a column of CHAR(3) would lead to a more narrow index that will be searched faster.
March 24, 2011 at 3:58 pm
Florent Rousseau (3/24/2011)
My understanding of b-tree structure of indexes was that it would held on the highest page level the list of the distinct values along with the pointers to the lower level pages, and on those lower level pages the location in the table of each and every occurence of that value.
A b-tree has every value at the leaf level. The next level up contains the starting value for each page in the level below. Repeat that until there is only one page. http://www.sqlservercentral.com/articles/Indexing/68439/ Nowhere does it store the distinct values.
A distinct would require an index scan and an aggregate operation to remove the duplicated. Probably stream aggregate if the index is appropriate, but it's still an aggregation that has to be done.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
Viewing 4 posts - 1 through 3 (of 3 total)
You must be logged in to reply to this topic. Login to reply