Query Optimizer question

  • Hello all,

    I have a query with a single join to a table (t2) that contains five records that performs badly (completes after more than 1 minute). The other table (t1) has more than 9 million records. If I remove the join and not use t2 and its 5 records, the query speeds up considerably (runs in about 2 seconds).

    Table 1 has a clustered index on col1, and col2 and col3 have non-clustered indexes on them.

    Table 2 has a clustered index on the ID column and it looks like:

    ID name

    1 aa

    2 bb

    3 cc

    4 dd

    5 ee

    The query looks like this:

    SELECT DISTINCT t1.col1,t1.col2,t1.col3,t1.entry_date,t1.col4,t2.name from t1 inner join t2 on t1.col4=t2.id

    WHERE t1.col1 IS NOT NULL AND t2.name = 'aa' AND t1.col3='zzz' AND t1.entry_date>='05/31/2006'

    When looking at the execution plan, I see that the first thing that happens is an index seek on t1.entry_date>='05/31/2006' and then index seek on col2 with an estimated row count of more than 9 million rows and then an index seek on col3, again with an estimated row count of more than 9 million records.

    Why would the optimizer use those indexes as opposted to selecting the columns based on the first index seek only? If I remove the join, it does not happen. Instead, the first index seek happens, and the results are filtered.

    Any ideas?

  • hi Trin,

    You don't mention an index on t1.entry_date in your description, yet you say that the query plan does an index seek via this index? What does this index look like? (Clearly it must be non-clustered, with at least the entry_date column, but are there andy additional columns?)

    I'm a little vague on how you just "remove the join", too. Are you saying that your modified query looks like:

    SELECT DISTINCT t1.col1,

    t1.col2,

    t1.col3,

    t1.entry_date,

    t1.col4,

    fromt1

    WHEREt1.col1 IS NOT NULL

    ANDt1.col3='zzz'

    ANDt1.entry_date>='05/31/2006'

    Could you run the query with SET SHOWPLAN_TEXT ON , and then post the output together with the actual structure of your tables and indexes?

    I don't think this is a difficult one to explain, but I just need to be sure that all the assumptions are correct first!

  • Hello Philip,

    Yes, there is an index on t1.entry_date (non-clustered) with no addtional columns.

    When I said "remove the join" I actually meant to replace it with an extra search argument:

    SELECT DISTINCT t1.col1,

    t1.col2,

    t1.col3,

    t1.entry_date,

    t1.col4,

    from t1

    WHERE t1.col1 IS NOT NULL

    AND t1.col3='zzz'

    AND t1.entry_date>='05/31/2006'

    AND t1.col4=1

    This, beacause col4 contains a value equal to the ID column in t2.

    The execution plan for the first query (including join) is:

    StmtText

    SELECT DISTINCT

                t1.col1,

                t1.col2, 

         t1.col3,

                t1.entry_date, 

                t2.name,

                t1.col4

    FROM t1 inner joi

    (1 row(s) affected)

    StmtText

      |--Parallelism(Gather Streams)

           |--Hash Match(Inner Join, HASH[t2].[ID])=([t1].[col4]),

    RESIDUAL[t2].[ID]=[t1].[col4]))

                |--Parallelism(Broadcast)

                |    |--Index Seek(OBJECT[db1].[dbo].[t2].[IX_Name]),

    SEEK[t2].[name]='aa') ORDERED FORWARD)

                |--Hash Match(Inner Join, HASH[t1].[col1])=([t1].[col1]))

                     |--Bitmap(HASH[t1].[col1]), DEFINE[Bitmap1006]))

                     |    |--Hash Match(Inner Join, HASH[t1].[col1])=([t1].[col1]))

                     |         |--Bitmap(HASH[t1].[col1]), DEFINE[Bitmap1005]))

                     |         |    |--Hash Match(Inner Join,

    HASH[t1].[col1])=([t1].[col1]))

                     |         |         |--Bitmap(HASH[t1].[col1]), DEFINE[Bitmap1004]))

                     |         |         |    |--Parallelism(Repartition Streams, PARTITION

    COLUMNS[t1].[col1]))

                     |         |         |         |--Index

    Seek(OBJECT[db1].[dbo].[t1].[T1EntryDate]), SEEK[t1].[entry_date] >= 'Mar 31 2006

    12:00AM') ORDERED FORWARD)

                     |         |         |--Parallelism(Repartition Streams, PARTITION

    COLUMNS[t1].[col1]), WHEREPROBE([Bitmap1004])=TRUE))

                     |         |              |--Index

    Seek(OBJECT[db1].[dbo].[t1].[IX_t1_col3]), SEEK[t1].[col3]='zzz') ORDERED FORWARD)

                     |         |--Parallelism(Repartition Streams, PARTITION

    COLUMNS[t1].[col1]), WHEREPROBE([Bitmap1005])=TRUE))

                     |              |--Index Scan(OBJECT[db1].[dbo].[t1].[IX_t1_col2]))

                     |--Parallelism(Repartition Streams, PARTITION COLUMNS[t1].[col1]),

    WHEREPROBE([Bitmap1006])=TRUE))

                          |--Index Scan(OBJECT[db1].[dbo].[t1].[T1col4]))

    (18 row(s) affected)

     

    And the execution plan for the second query, after removing the join is significately different:

    StmtText

    SELECT DISTINCT

                t1.col1,

                t1.col2, 

         t1.col3,

                t1.entry_date, 

                t1.col4

    FROM t1

    WHERE t1.col1 IS NOT NULL

    (1 row(s) affected)

    StmtText

      |--Parallelism(Gather Streams)

           |--Filter(WHERE[t1].[col3]='zzz'))

                |--Bookmark Lookup(BOOKMARK[Bmk1000]), OBJECT[db1].[dbo].[t1]))

                     |--Hash Match(Inner Join, HASH[t1].[col1])=([t1].[col1]))

                          |--Bitmap(HASH[t1].[col1]), DEFINE[Bitmap1002]))

                          |    |--Parallelism(Repartition Streams, PARTITION COLUMNS[t1].[col1]))

                          |         |--Index Seek(OBJECT[db1].[dbo].[t1].[T1col4]), SEEK[t1].[col4]=1)

    ORDERED FORWARD)

                          |--Parallelism(Repartition Streams, PARTITION COLUMNS[t1].[col1]),

    WHEREPROBE([Bitmap1002])=TRUE))

                               |--Index Seek(OBJECT[db1].[dbo].[t1].[T1EntryDate]),

    SEEK[t1].[entry_date] >= 'Mar 31 2006 12:00AM') ORDERED FORWARD)

    (9 row(s) affected)

     

    In this case, the indexes on col2 and col3 are not touched. Instead, a filter is applied.

    Thanks for your help! 

  • >>When I said "remove the join" I actually meant to replace it with an extra search argument

    By doing that, you are creating an entirely different scenario for the optimizer and different plans are to be expected.

    When you supply 1 hard-coded value, the optimizer looks at the statistics and distribution of data in t1.col4 and determines that an index seek is appropriate due to the selectivity.

    Whne you join t1.col4 to a table containing 5 values, the optimizer determines that the number of rows exceeds the selectivity threshold and an index seek is not used.

    You're really comparing apples and oranges.

     

  • What about the index seek on columns col2 and col3 for the first query? in both cases those columns are selected, they have indexes, and they are not in the search argument. The cost of doing an index seek on more than 9 million rows, twice, is big, and I believe accounts for a great portion of the delay of the first query.

    Any thoughts on that?

    Thanks in advance.

  • I can't see an index seek on col2. There is an index seek on col3, but this is in the Search argument, so doesn't come as a total surprise:

    WHERE t1.col1 IS NOT NULL

    AND t1.col3='zzz'

    If there are 9 million rows in table t1, then an index seek won't touch 9 million rows (otherwise it would be an index scan). The fact that it is SEEKing via a NON-clustered index means that it is only touching a very small proportion of the total records, and so should have a comparatively low cost compared to that of scanning the entire table.

    As PW says, you can't change the query completely (by removing the join) and expect the query plan to look the same. It will look quite different, as you've demonstrated!

    Is t1.col1 supposed to be a primary key? If it is, then it should have a PK constraint defined on it. This would mean that you could remove the "t1.col1 IS NOT NULL" clause, since NULL values would not be allowed.

    I'd also make the index on col1 NON-clustered (even if it's a primary key). Put the clustered index on entry_date instead. That will be far more effective at restricting the matching records to those required for your Search Argument.

  • Hello Phillip,

    Thanks for your comments.

    The PK in table 1 is not on col1, but on a group of columns. I agree that it would be more beneficial to have the clustered index on entry_date, since queries based on date ranges are the most often requested. Since it is not my decision to make, I will see what I can do about that.

    When showing the graphical execution plan there is an index seek on col3 and it shows a row count of more than 9 million rows. The index seek on col1 is smaller, with about 800,000 rows.

    Thank you again!!

     

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply