Advantage to creating an Index when searching for NULLs?

  • Hello,

    I was just asked a question by my boss. Is there an advantage to indexing a column if you are going to search on it like this:

         SELECT * FROM tblA

         WHERE column1 IS NULL

    Assuming that it is selective (few NULL values compared to the overall number of rows in the table).

    Is it likely that creating an index in tblA.column1 will help the query?

  • Why don't you run a simple test and show results to your boss?

    _____________
    Code for TallyGenerator

  • "Non-sargable search arguments in the WHERE clause, such as "IS NULL", "", "!=", "!>", "!<", "NOT", "NOT EXISTS", "NOT IN", "NOT LIKE", and "LIKE '%500'" generally prevents (but not always) the query optimizer from using an index to perform a search." So no...it won't help because it most likely will not be used anyway...

    src: http://www.sql-server-performance.com/transact_sql.asp


    Cheers,

    Ben Sullins
    bensullins.com
    Beer is my primary key...

  • Ben, I would suggest to you as well to run a simple test.

    You cannot believe how much stuff you can find in Internet is absolute cr.p.

    And you probably missed this: "generally prevents (but not always)"

    _____________
    Code for TallyGenerator

  • I agree that test is the best way to find out.

    I will not quote any authorities (true or would-be), but based on my experience, if the table is large and does not contain many NULLs in the column, such index would help a lot.

  • I would try to simulate it, but the scenario involves data to be populated in a table in the future. We're talking about 100s of thousands of rows, with only a few hundred rows will NULL values in the column we're looking for.

    When my boss asked me the question, I thought the index wouldn't help. I just wanted to find out if anyone had any experience or knew of any definitive answer (I tried googling it, but I got a bunch of non-related results).

    Ben, thanks for posting a constructive answer.

  • Your best bet here really is to run a test of your own as Sergiy and Vladan have stated.  While you may find sources that say that it may or may not help, I would advise taking it with a grain of salt and testing it for yourself especially since this would be an easy mock up for you.  Even with 1 million rows, you could create a simple test to prove or disprove the effects of adding the index with relative ease.  As a rule, even when it makes total sense that a certain indexing scenario should provide improvements, I always run a test so I have proof that a change that I recommend is going to help.

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

  • I've got a small table with about 9k rows, and it has a column that has about 3k of nulls. When I select top 50 where column is a null, I get nice index seek against the column index. Remove the top and it changes to a clustered scan. No final answer there, but it would appear in my test that the index does help for the Top N cases.

    I think its fair to point out the value of testing, but it's also nice to just ask a question and have a conversation about it.

  • It's not exacly about TOP N.

    It's about shortest way to the target.

    When you do SELECT * in a query involving non-clustered index seek SQL Server needs to perform bookmark lookup in order to return all requested values.

    In case of TOP 50 the cost of index seek + lookup is smaller than the cost of clustered index scan.

    If you don't have TOP 50 statistics will tell optimiser that it's gonna be lookup for 30% of rows and it's gonna be faster to perform single clustered index scan than index seek with following bookmark lookup.

    Change your index used in WHERE clause to be clustered and it will be "clustered index seek" in both occasions.

    _____________
    Code for TallyGenerator

  • And, if you do have a clustered index in the right spot...

    GO

    --===== Create and populate a 1,000,000 row test table.

         -- Column RowNum has a range of 1 to 1,000,000 unique numbers

         -- Column "SomeInt" has a range of 1 to 50,000 non-unique numbers

         -- Column "SomeString" has a range of "AA" to "ZZ" non-unique 2 character strings

         -- Column "SomeNumber has a range of 0.0000 to 99.9999 non-unique numbers

         -- Column "SomeDate" has a range of  >=01/01/2000 and <01/01/2010 non-unique date/times

         -- Takes about 42 seconds to execute.

     SELECT TOP 1000000

            RowNum     = IDENTITY(INT,1,1),

            SomeInt    = CAST(RAND(CAST(NEWID() AS VARBINARY))*50000+1 AS INT),

            SomeString = CHAR(STR(RAND(CAST(NEWID() AS VARBINARY))*25+65))

                       + CHAR(STR(RAND(CAST(NEWID() AS VARBINARY))*25+65)),

            SomeCSV    = 'Part1,Part2,Part3,Part4,Part5,Part6,Part7,Part8,Part9,Part10',

            SomeNumber = CAST(RAND(CAST(NEWID() AS VARBINARY))*100 AS MONEY),

            SomeDate   = CAST(RAND(CAST(NEWID() AS VARBINARY))*3653.0+36524.0 AS DATETIME)

       INTO dbo.JBMTest

       FROM Master.dbo.SysColumns sc1,

            Master.dbo.SysColumns sc2 --Lack of join criteria makes this a CROSS-JOIN

    --===== A table is not properly formed unless a Primary Key has been assigned

      ALTER TABLE dbo.JBMTest

            ADD PRIMARY KEY CLUSTERED (RowNum)

    --===== Create an index on the test column

     CREATE INDEX IX_JBMTest_SomeInt ON dbo.JBMTest (SomeInt)

    --===== Change some of the data in the test column to NULL (more than 30%)

     UPDATE dbo.JBMTest

        SET SomeInt = NULL

      WHERE SomeInt <= 16000

    --======================================================================

    --  Run some tests... "*" sucks...

    --======================================================================

    SET STATISTICS IO ON   

     SELECT * --===== Test1

       FROM dbo.JBMTest

      WHERE SomeInt IS NULL

     SELECT RowNum,SomeInt --===== Test2

       FROM dbo.JBMTest

      WHERE SomeInt IS NULL

     SELECT TOP 50 * --===== Test3

       FROM dbo.JBMTest

      WHERE SomeInt IS NULL

     SELECT TOP 50 RowNum,SomeInt  --===== Test4

       FROM dbo.JBMTest

      WHERE SomeInt IS NULL

    SET STATISTICS IO OFF 

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Of course if you drop the indexes altogether, 3 of the 4 tests should be slightly faster than with the indexes (the second won't be, as the columns are covered). That obviously only applies in certain data scenarios, but this happens to be one of them.

  • Exactly. For example, having read the cited article before, I was never able to understand why something like "in" and "exists" are sargable but "not in" and "not exists" are not. The "not" only takes the result of the "in" or "exists" and inverts it. If one is sargable, the other has to be.

    As for the specific question, I would fully expect an index to help. Null values would appear before any other value so the optimizer should know that as soon as a non-null appears, the search is over.

    However, yes, do perform a test. My expectations have been dashed before...

    Tomm Carr
    --
    Version Normal Form -- http://groups.google.com/group/vrdbms

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply