Binary Index Question

  • I've been requested by one of our "developers" to place an index on a table column. The column is a "yes/no" column where a very small percentage (less than 3% of the records in a table with 7+ million rows will be "N").

    I seem to recall that this is probably not a great solution for several reasons. If someone might be able to enumerate some of those reasons I'd like to be able ot go back to the developers and present my case for creating another table that would contain rows which reflect the "Y" records.

    Thanks!

    Glenn

  • If you're serching for 'Y', the query engine is going to do a table scan anyway but it would use the index for 'N'. So it depends on the queries that you will be running.

  • With most DBMSs, a low cardinality index very seldom gives a performance advantage. 

    When a DBMS is deciding to use an index, it will first calculate the I-O 'cost' of a table scan.  Every extent it has to scan costs 1 I-O.  It will then calculate the I-O cost of using the index, and if using the index saves I-O operations it will choose the index.  Therefore, if your data occurs so frequently in the table that every extent has to be accessed, there is no I-O saving in using the index - there is actually a cost because the index has to be acessed as well.  In this situation SQL should do a table scan.

    If you make a low cardinality index the clustered index then you can save I-O, but it is unlikely that a low-cardinality index is the best choice for the cluster index.  In DB2 you have multi-dimensional clustering, and a low-cardinality MDC index can be good for performance.

    Original author: https://github.com/SQL-FineBuild/Common/wiki/ 1-click install and best practice configuration of SQL Server 2019, 2017 2016, 2014, 2012, 2008 R2, 2008 and 2005.

    When I give food to the poor they call me a saint. When I ask why they are poor they call me a communist - Archbishop Hélder Câmara

  • The answer to this really depends on too many things to give a straight "do or don't do" type of answer.  I've had situations where it was beneficial to have an index on a boolean, but only if the conditions in a select statement based on the boolean would only return a very small number of rows.  In this case, the optimizer seeks the index to create a bookmark list and then does the lookup in the clustered index.

    I have tried to duplicate the results by using a table containing the clustering keys of the low-density "marked" rows, but it was definitely faster to use an index on a boolean column.

    Worth of note is that if you intend to use an index on bit column, you need to perform an explicit conversion to the bit datatype in the where clause.  Like so:

    Delete foo where myBitColumn = convert(bit,1)

    instead of

    Delete foo where myBitColumn = 1

     

    Don't know why the optimizer doesn't implicilty convert the "1" to a bit, but it doesn't.

    Here's a little something to illustrate the differences:

    --------------------------------------------------------

    create table foo (

    bar int primary key clustered,

    bim int,

    bat bit)

    CREATE

      INDEX [blarg] ON [dbo].[foo] ([bat])

    Insert foo

    select 1,1,0

    union

    select 2,2,0

    union

    select 3,3,1

    go

    Create table fizz (      

    bar int primary key clustered)

    Insert fizz

    select 3

    go

    set showplan_all on

    go

    select * from foo where bat = convert(bit,1)

    select * from foo where bat = 1

    select a.* from foo a

    Inner join fizz b on a.bar=b.bar

    go

    set showplan_all off

    go

    drop table foo

    drop table fizz

    ---------------------------------

    The example illustrates the differences between using an indexed bit column vs an auxillary table, and shows that in this simple case, a bit index can be useful.  From what I've seen, this does scale to large numbers of rows.

    hth jg

     

  • I've sometimes made gains by making the bit column part of another index. I often encounter tables which have a column to designate current or active data. Every query always ends in " and where xx.current =1 "  ( n.b. the bits work better if the query is written where current='1'  ) or worse still  " where xx.cuurent=1 and yyy.current=1 and zzz.current=1 "

    In certain circumstances these cause scans, indexes can certainly improve matters but it's a case of test and try. Many "DBAs" think you can't index a bit as EM will not allow you ( well sql2k anyway ) 

    [font="Comic Sans MS"]The GrumpyOldDBA[/font]
    www.grumpyolddba.co.uk
    http://sqlblogcasts.com/blogs/grumpyolddba/

  • Thanks all - this has all been great input!

    However, lest we get to far afield... the index column is a character (char1) containing a "Y" or "N". Of the 7+million records approximately 30,000 will be "Y" the rest will be "N". So, the questions is: do I promote the idea of creating a separate table to contain information about the ~30,000 "Y" records which will be INNER JOINED for queries or do I add an index on the "Y/N" column in the table?

    Thanks

    Glenn

  • If you are selecting the 'Y' records, then I would think an index would help. I don't like the idea of a separate table if, sometimes, you want all records regardless of flag or, especially, if the flag can change - you would have to move records between tables.

    On the other hand, if these are genuinely separate entities, then they should be in separate tables anyway. Are there any normalisation issues here?

  • The short answer is try it, test it. If it works/helps good  if not forget it!!

    [font="Comic Sans MS"]The GrumpyOldDBA[/font]
    www.grumpyolddba.co.uk
    http://sqlblogcasts.com/blogs/grumpyolddba/

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply