[Column] LIKE 'VALUE%' indexing?!

  • I have a query I'm trying to run on SQL Azure.

    The table being searched is two columns wide. One column is called Keyword, the other is called ItemId.

    The PK of the table is [Keyword, ItemId].

    The query SELECT ItemId FROM [Table] WHERE Keyword = 'X' works great.

    But the query SELECT ItemId FROM [Table] WHERE Keyword LIKE 'X%' drags terribly.

    I'd assumed the second query would work almost as well as the first because the start of the Keywords should be as index-seekable as the full keyword. Was I wrong in this assumption?

    (NOTE: SELECT ItemId FROM [Table] WHERE LEFT(Keyword, LEN(X)) = 'X' drags just as bad)

  • Please post table definitions, index definitions and execution plans of both of those queries, as per http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • TABLE Definition:

    CREATE TABLE [dbo].[DescriptionKeywordLookup](

    [Keyword] [nvarchar](255) NOT NULL,

    [ItemId] [int] NOT NULL,

    CONSTRAINT [PK_DescriptionKeywordLookup] PRIMARY KEY CLUSTERED

    (

    [Keyword] ASC,

    [ItemId] ASC

    )

    )

    --------------------------------------------------------

    Execution Plans attached.

    Runtime for WHERE = is 27s

    Runtime for WHERE Like is 1m36s

  • Query 1: select * from descriptionkeywordlookup where keyword = 'X' or keyword = 'Y'

    Query 2: select * from descriptionkeywordlookup where keyword like 'X%' or keyword like 'Y%'

  • While that's listed as a seek (for the like), it's almost a full table scan. The predicate is Keyword > '-', which every single entry starting with A..Z, a..z, so SQL's pretty much reading the entire table. It's because of the OR, the optimiser's decided for whatever reason to do a very non-restrictive seek predicate as the index seek, then filter out the rows that don't match

    It's not that uncommon to find OR queries that 'break' the optimiser.

    You may be better off in this case using UNION.

    SELECT keyword ,

    item

    FROM descriptionkeywordlookup

    WHERE keyword LIKE 'series%'

    UNION ALL

    SELECT keyword ,

    item

    FROM descriptionkeywordlookup

    WHERE keyword LIKE '-%'

    I'm using Union All rather than union because there is no way that rows can overlap in this case.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Well to add some further context, it's part of a stored procedure that does this:

    Input is @SearchString nvarchar(255)

    Separate @SearchString into table variable

    E.g. 'This Is a Search String' becomes

    Keyword

    ---------

    This

    Is

    A

    Search

    String

    Then I inner join that @Keyword table to my Items table. The INNER JOIN drags like the OR.

    Also the UNION ALL took roughly the same amount of time regardless. It's very confusing. Thanks for your help btw.

  • Bumping

  • aurato (1/9/2012)


    Bumping

    This is not paid work, you're not our #1 priority.

    That fee is 125$ / hour and up if you're interested.

  • There's still a question? Thought from your comment you were happy.

    Want anything useful? Post actual execution plans and queries and table designs. (p.s. you have huge tables, don't expect instant results)

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (1/9/2012)


    While that's listed as a seek (for the like), it's almost a full table scan. The predicate is Keyword > '-', which every single entry starting with A..Z, a..z, so SQL's pretty much reading the entire table. It's because of the OR, the optimiser's decided for whatever reason to do a very non-restrictive seek predicate as the index seek, then filter out the rows that don't match...

    It's nothing to do with the OR. SQL Server can quite happily seek on two LIKE predicates, whether they overlap or not. If the LIKE terms are constants or run-time constants of the form '<string>%', the optimizer will resolve the predicate to either one or two seeks in the same operator, depending on whether one range subsumes the other or not. For example "Keyword LIKE N'series%' OR dkl.Keyword LIKE N'ser%'" results in a single range seek, whereas "Keyword LIKE N'series%' OR dkl.Keyword LIKE N'set%'" results in two distinct seek ranges in the same operator. Example:

    SELECT

    dkl.Keyword,

    dkl.ItemId

    FROM dbo.DescriptionKeywordLookup AS dkl WITH (FORCESEEK)

    WHERE

    dkl.Keyword LIKE N'series%' OR dkl.Keyword LIKE N'snow%'

    There is a specific issue with searching on '-%', which appears to be a collation subtlety (or perhaps a SQL Server bug) related to the '-' character. This specific issue (failure to find the end of the LIKE range) does not occur if the column has a binary collation, for example. The other thing is I noticed is that the example queries specify single-byte character strings in the LIKE predicates. It's not a big concern here, but they really ought to be Unicode strings, e.g. N'series%'.

  • aurato (1/9/2012)


    Also the UNION ALL took roughly the same amount of time regardless. It's very confusing.

    It's a shame Azure does not support full-text indexing yet. Question: do you see poor performance just for the LIKE N'-%' query, or is it a problem for all search strings? As I mentioned previously, LIKE N'-%' results in a very inefficient search for some reason, but other cases optimize well and ought to be relatively quick.

    A table variable may not be the best choice for your JOIN query, though it is hard to speculate on so few details. If general LIKE searches are acceptably fast (except N'-%') you'll need to share some more details about that procedure. You might like to try a temporary table instead of a table variable, just to see if statistics availability helps whatever plan you are getting. Alternatively, post the specific problematic plans!

  • SQL Kiwi (1/9/2012)


    SQL Server can quite happily seek on two LIKE predicates, whether they overlap or not.

    I know it can, but I've seen it do strange things with ORs before so...

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (1/9/2012)


    There's still a question? Thought from your comment you were happy.

    Want anything useful? Post actual execution plans and queries and table designs. (p.s. you have huge tables, don't expect instant results)

    I thought my post indicated that I was still stuck with the slowness. Sorry I didn't get that across.

    Ninja's_RGR'us (1/9/2012)


    aurato (1/9/2012)


    Bumping

    This is not paid work, you're not our #1 priority.

    That fee is 125$ / hour and up if you're interested.

    Is bumping considered rude on this forum? I was worried that my question was considered closed, so I bumped. Judging by the above quote, I was correct to.

    SQL Kiwi (1/9/2012)


    GilaMonster (1/9/2012)


    While that's listed as a seek (for the like), it's almost a full table scan. The predicate is Keyword > '-', which every single entry starting with A..Z, a..z, so SQL's pretty much reading the entire table. It's because of the OR, the optimiser's decided for whatever reason to do a very non-restrictive seek predicate as the index seek, then filter out the rows that don't match...

    It's nothing to do with the OR. SQL Server can quite happily seek on two LIKE predicates, whether they overlap or not. If the LIKE terms are constants or run-time constants of the form '<string>%', the optimizer will resolve the predicate to either one or two seeks in the same operator, depending on whether one range subsumes the other or not. For example "Keyword LIKE N'series%' OR dkl.Keyword LIKE N'ser%'" results in a single range seek, whereas "Keyword LIKE N'series%' OR dkl.Keyword LIKE N'set%'" results in two distinct seek ranges in the same operator. Example:

    SELECT

    dkl.Keyword,

    dkl.ItemId

    FROM dbo.DescriptionKeywordLookup AS dkl WITH (FORCESEEK)

    WHERE

    dkl.Keyword LIKE N'series%' OR dkl.Keyword LIKE N'snow%'

    There is a specific issue with searching on '-%', which appears to be a collation subtlety (or perhaps a SQL Server bug) related to the '-' character. This specific issue (failure to find the end of the LIKE range) does not occur if the column has a binary collation, for example. The other thing is I noticed is that the example queries specify single-byte character strings in the LIKE predicates. It's not a big concern here, but they really ought to be Unicode strings, e.g. N'series%'.

    This post cleared up basically all of my remaining confusion. Thank you very much.

    SQL Kiwi (1/9/2012)


    aurato (1/9/2012)


    Also the UNION ALL took roughly the same amount of time regardless. It's very confusing.

    It's a shame Azure does not support full-text indexing yet. Question: do you see poor performance just for the LIKE N'-%' query, or is it a problem for all search strings? As I mentioned previously, LIKE N'-%' results in a very inefficient search for some reason, but other cases optimize well and ought to be relatively quick.

    A table variable may not be the best choice for your JOIN query, though it is hard to speculate on so few details. If general LIKE searches are acceptably fast (except N'-%') you'll need to share some more details about that procedure. You might like to try a temporary table instead of a table variable, just to see if statistics availability helps whatever plan you are getting. Alternatively, post the specific problematic plans!

    The LIKE query without the '-' works fine. Thanks again. I'm going to retry my procedure and omit the hyphen to see if that's what was causing the whole thing.

  • Bumping is always frowned upon here (unless the next day or something like that).

    I didn't think this was done neither did Paul.

  • aurato (1/9/2012)


    This post cleared up basically all of my remaining confusion. Thank you very much.

    You're welcome.

    aurato (1/9/2012)


    The LIKE query without the '-' works fine. Thanks again. I'm going to retry my procedure and omit the hyphen to see if that's what was causing the whole thing.

    I'm thinking the issue with the hyphen (or dash) character has to do with interpreting it in a culture-aware fashion. I'll post back if I get something more concrete on this. By the way, unless you really need Unicode with full language and culture aware comparisons, I would consider using VARCHAR instead of NVARCHAR - SQL collations typically compare and sort several times faster than Unicode collations. Faster still are the binary collations, but these are not suitable unless your LIKE comparisons can tolerate a strict byte-by-byte comparison (e.g. 'q' = 'Q'). Obviously, changing data type and/or collation would be a big job, given the size of the data, but if it is appropriate for you, it could be a big performance win. Another idea would be to hash words to integers, and seek on integer hash values.

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic. Login to reply