Is using Included columns in an index the best choice?

  • Hi

    I am just trying to get my head around a few concepts when it comes to indexing, in particular covering indexes and included columns...

    For the query below, I have created a non-clustered index on the ReceivedDate field with no included columns. Aside from the clustered index (a single key which is not being selected or filtered at all by this query), this is the only index on the CallDetails_backup table, and running the query is producing a clustered index scan as per the attachment CIScan.jpg.

    What I am trying to acheive is a non-clustered index seek based upon the ReceivedDate field.

    I'm assuming that SQL Server has opted for a clustered index scan as there is insufficient information in the non-clustered index pages to retrieve all of the columns. Presumably I could turn this into a non-clustered index seek by including all of the other columns being returned by the query (as included columns), however what if my query is outputting 100+ fields? Should I be including them all? Would this actually be more efficient than a clustered index scan? Or in this case is a clustered index scan the most efficient way?

    I was also wondering if perhaps this is an 'index tipping point' issue - the query is returning 711 rows out of about 60,000, so given the number of rows involved, has SQL Server just decided that an index scan is best?

    Any input appreciated.

    Thanks

    DECLARE

    @StartDate datetime = '01/01/2011'

    ,@FinishDate datetime = '02/01/2011'

    SELECT

    cd.CallNo,cd.CleoCallID,cd.ReceivedDate,cd.CurrentPostCode,cd.ApptStartDate,cd.ApptEndDate,cd.CreatedBy --...up to 100 fields

    FROM

    dbo.CallDetails_backup cd

    WHERE

    cd.ReceivedDate >= @StartDate and cd.ReceivedDate <= @FinishDate

    and cd.Classification <> 'Duplicate'

    and cd.[Service] = 'OOH'

    and cd.SentToAdastra is null

    and cd.CompletionDate is not null

  • I don't necessarily think this is an indexing problem. You want to try and keep the number of included columns to a minimum. Looking at your image, this appears to be a problem with your statistics. The query optimizer is estimating that 2,591 rows will be returned and that is why it is choosing the scan. Try updating your statistics with a full scan.

  • At 1% of the data you should be able to get a seek (not garanteed, bu very probable).

    Including all the columns of the table would mean that you have 2 copies of the same data which gets very expansive in the hds and dmls (I/U/D).

    In this particular case I'd do it this way :

    Nonclustered :

    ReceivedDate, Service, SentToAdastra INCLUDE (CompletionDate, Classification )

    You could swap the service and SentToAdastra columns depending on whichever is more selective (that should be the 2nd column of the index).

    You could play around with including or not the last 2 columns to see if it's faster (less reads) to have them straight in the index. I think you'll find it's not.

  • tgarland (5/19/2011)


    I don't necessarily think this is an indexing problem. You want to try and keep the number of included columns to a minimum. Looking at your image, this appears to be a problem with your statistics. The query optimizer is estimating that 2,591 rows will be returned and that is why it is choosing the scan. Try updating your statistics with a full scan.

    Yes... but also having the right nc index for that query will also give much better stats. The stats could be 100% up to date and yet completely wrong for that query without that extra index.

    It would be a nice stude case to see if just adding the stats for those 5 columns could also speed up the query (without having to create the index). I haven't read much about that idea except a few theories.

  • I don't disagree that having the correct nonclustered index will produce better results but the current index seems to be pretty selective. Not knowing how many services there are, I don't know that will really narrow it down that much more. Pretty good disparity between estimated and actual.

  • tgarland (5/19/2011)


    I don't disagree that having the correct nonclustered index will produce better results but the current index seems to be pretty selective. Not knowing how many services there are, I don't know that will really narrow it down that much more. Pretty good disparity between estimated and actual.

    Actually the pivot point to "garantee" a seek is well under 0.5% of the data. So that's 2-3 times less data that the query is returning.

    Moreoever adding 2 more columns with = predicates should narrow it down pretty good (unless like you said it's a male / female column).

    Check this case stury out :

    http://sqlinthewild.co.za/index.php/2009/01/09/seek-or-scan/

  • Thanks guys for the speedy responses. I think it's actually a parameter sniffing issue - if I hard code the dates into the WHERE clause then I get the index seek with a key lookup without having to make any changes to the index, although I will still play around with the indexing to see if I can get a performance gain. I found a great article which seems to describe the problem I am having exactly:

    http://www.sqlsoldier.com/wp/sqlserver/misunderstoodandabusedupdatestatisticscommand

    From my initial testing, the index seek/key lookup is definitely faster than a clustered index scan. The only thing is, the only way I can seem to get the query to use an index seek without hard coding the parameters is by forcing the query to recompile every time, using Option(Recompile). Is there any way to force the index seek without having to recompile every time?

  • What's the cost of recompile VS cost of index scan? I'd take the less expansive of the 2.

    I'd also strongly consider adding my index to see if the problem goes away.

  • As a general rule, NC indexes are poor at seeking on potentially large numbers of rows (unless they account for a very low percentage of the overall table). I'd certainly give Ninja's index a try as I think this will at least help - it will certainly restrict the number of matching records more than the current index.

    Like you, I suspect that you're very close to a "tipping point", where the number of IOs needed to fetch 771 rows via a NC index is very close to the number needed to scan the entire clustered index, and the optimiser has just gone for the path of least resistance. Even if you aren't quite at that point, a different query (one covering 2 days of call data, for instance), would almost certainly become too inefficient for a NC seek.

    You definitely don't want to include all the columns in the index, because the index would be nearly as wide as the table - i.e. huge!

    I don't know the nature of the data in this table, but if I'm right in thinking that it's storing telephone call records, they will probably be added to the table in ReceivedDate order. If that's the case, is there any reason why you don't cluster the table on ReceivedDate? It's likely that new records will be added sequentially, so you shouldn't be penalised with page splitting, and a clustered index would be much more efficient at returning date-range searches than a non-clustered index. A clustered index of [ReceivedDate, Service, SentToAdastra] would be highly efficient for this particular query as it wouldn't need to do any key lookups from a NC index into the main table.

    Your Primary Key (whatever that may be) could be implemented quite happily as a non-clustered index, since it's in the nature of PKs that they are used to return single row result sets - something NC indexes are extremely efficient at. That NC index would automatically include the [ReceivedDate, Service, SentToAdastra] columns as well (as they form the clustered index key), which might prove to be a useful covering index for some other queries.

    Whether this will work well depends on the nature of the queries you run most often. If you regularly query by ReceivedDate, then that is more likely to match large numbers of records and I think this could be a good solution. If you only run this sort of query occasionally, and usually search on much more restrictive combinations of columns, then maybe an alternate clustered index would be better. We don't really have the information here to advise on that.

  • Thanks Philip for your reply, I see what you're saying. I haven't had much of a chance to play with this recently but I will certainly try Ninja's index as soon as I can.

    Currently the table is clustered on an identity column (the PK) which isn't being used at all in any of the queries (in this case stored procedures for SSRS reports). Thinking about it I guess this isn't the most efficient use of the clustered index as the majority of the SSRS reports (but not all) will be using date range searches against the CreatedDate field, so as you say it may be worthwhile changing the clustered index.

    It's certainly something I will consider going forward, thanks for taking the time to explain.

  • No problem. Just to recap - whatever you use as your clustered key, make sure that the data will be added to the table in the clustered key order. If it doesn't, and is instead added in a different non-sequential order, you will end up doing inserts into the middle of the table on data pages that may already be full. This will result in page splitting which will hurt INSERT performance and cause table fragmentation. If you add lots of data, this could be a serious reason to NOT cluster that table on that key.

  • Try to achieve Index Seek because Index seek is better than index scan

    and Included columns are best choice because in the Tree of index ,it gets the whole information from the index

    what is the index definition ?

    Regards,
    Syed Jahanzaib Bin Hassan
    BSCS | MCTS | MCITP | OCA | OCP | OCE | SCJP | IBMCDBA

    My Blog
    www.aureus-salah.com

  • Yes, but he's selecting over 100 columns, so I don't really think a covered index is an option here ...

    Also, if he's retrieving a few days worth of records about telephone calls made, there's every chance that that will match a large number of records. Considering the number of columns involved too, this makes a NC index a bad choice. A clustered index would be a more efficient option.

    Clustering on the call date will allow a clustered index seek on the StartDate, followed by a partial scan to the EndDate, if the date field is specified in the WHERE clause. Be aware, though, that depending on the date range given, this "seek" can be very large indeed. In extreme cases, it could be equivalent to the entire table (if the start and end date ranges equal the first and last rows in the table). It will still show as an index SEEK though, even though it's effectively an entire clustered index scan, because technically that's what the optimiser is doing.

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply