[Column] LIKE 'VALUE%' indexing?!

Question

[Column] LIKE 'VALUE%' indexing?!

aurato

SSCrazy

Points: 2575
More actions
January 9, 2012 at 7:38 am

#249301

I have a query I'm trying to run on SQL Azure.
The table being searched is two columns wide. One column is called Keyword, the other is called ItemId.
The PK of the table is [Keyword, ItemId].
The query SELECT ItemId FROM [Table] WHERE Keyword = 'X' works great.
But the query SELECT ItemId FROM [Table] WHERE Keyword LIKE 'X%' drags terribly.
I'd assumed the second query would work almost as well as the first because the start of the Keywords should be as index-seekable as the full keyword. Was I wrong in this assumption?
(NOTE: SELECT ItemId FROM [Table] WHERE LEFT(Keyword, LEN(X)) = 'X' drags just as bad)

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic. Login to reply

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 1

Please post table definitions, index definitions and execution plans of both of those queries, as per http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

aurato SSCrazy Points: 2575 More actions · Answer 2

TABLE Definition:

CREATE TABLE [dbo].[DescriptionKeywordLookup](

[Keyword] [nvarchar](255) NOT NULL,

[ItemId] [int] NOT NULL,

CONSTRAINT [PK_DescriptionKeywordLookup] PRIMARY KEY CLUSTERED

(

[Keyword] ASC,

[ItemId] ASC

)

--------------------------------------------------------

Execution Plans attached.

Runtime for WHERE = is 27s

Runtime for WHERE Like is 1m36s

aurato SSCrazy Points: 2575 More actions · Answer 3

Query 1: select * from descriptionkeywordlookup where keyword = 'X' or keyword = 'Y'

Query 2: select * from descriptionkeywordlookup where keyword like 'X%' or keyword like 'Y%'

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 4

While that's listed as a seek (for the like), it's almost a full table scan. The predicate is Keyword > '-', which every single entry starting with A..Z, a..z, so SQL's pretty much reading the entire table. It's because of the OR, the optimiser's decided for whatever reason to do a very non-restrictive seek predicate as the index seek, then filter out the rows that don't match

It's not that uncommon to find OR queries that 'break' the optimiser.

You may be better off in this case using UNION.

SELECT keyword ,

item

FROM descriptionkeywordlookup

WHERE keyword LIKE 'series%'

UNION ALL

SELECT keyword ,

item

FROM descriptionkeywordlookup

WHERE keyword LIKE '-%'

I'm using Union All rather than union because there is no way that rows can overlap in this case.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

aurato SSCrazy Points: 2575 More actions · Answer 5

Well to add some further context, it's part of a stored procedure that does this:

Input is @SearchString nvarchar(255)

Separate @SearchString into table variable

E.g. 'This Is a Search String' becomes

Keyword

---------

This

Is

A

Search

String

Then I inner join that @Keyword table to my Items table. The INNER JOIN drags like the OR.

Also the UNION ALL took roughly the same amount of time regardless. It's very confusing. Thanks for your help btw.

aurato SSCrazy Points: 2575 More actions · Answer 6

aurato

SSCrazy

Points: 2575

January 9, 2012 at 8:38 am

#1430520

Bumping

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 7

aurato (1/9/2012)
Bumping

This is not paid work, you're not our #1 priority.

That fee is 125$ / hour and up if you're interested.

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 8

There's still a question? Thought from your comment you were happy.

Want anything useful? Post actual execution plans and queries and table designs. (p.s. you have huge tables, don't expect instant results)

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Paul White SSC Guru Points: 150468 More actions · Answer 9

GilaMonster (1/9/2012)
While that's listed as a seek (for the like), it's almost a full table scan. The predicate is Keyword > '-', which every single entry starting with A..Z, a..z, so SQL's pretty much reading the entire table. It's because of the OR, the optimiser's decided for whatever reason to do a very non-restrictive seek predicate as the index seek, then filter out the rows that don't match...

It's nothing to do with the OR. SQL Server can quite happily seek on two LIKE predicates, whether they overlap or not. If the LIKE terms are constants or run-time constants of the form '<string>%', the optimizer will resolve the predicate to either one or two seeks in the same operator, depending on whether one range subsumes the other or not. For example "Keyword LIKE N'series%' OR dkl.Keyword LIKE N'ser%'" results in a single range seek, whereas "Keyword LIKE N'series%' OR dkl.Keyword LIKE N'set%'" results in two distinct seek ranges in the same operator. Example:

SELECT

dkl.Keyword,

dkl.ItemId

FROM dbo.DescriptionKeywordLookup AS dkl WITH (FORCESEEK)

WHERE

dkl.Keyword LIKE N'series%' OR dkl.Keyword LIKE N'snow%'

There is a specific issue with searching on '-%', which appears to be a collation subtlety (or perhaps a SQL Server bug) related to the '-' character. This specific issue (failure to find the end of the LIKE range) does not occur if the column has a binary collation, for example. The other thing is I noticed is that the example queries specify single-byte character strings in the LIKE predicates. It's not a big concern here, but they really ought to be Unicode strings, e.g. N'series%'.

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Paul White SSC Guru Points: 150468 More actions · Answer 10

aurato (1/9/2012)
Also the UNION ALL took roughly the same amount of time regardless. It's very confusing.

It's a shame Azure does not support full-text indexing yet. Question: do you see poor performance just for the LIKE N'-%' query, or is it a problem for all search strings? As I mentioned previously, LIKE N'-%' results in a very inefficient search for some reason, but other cases optimize well and ought to be relatively quick.

A table variable may not be the best choice for your JOIN query, though it is hard to speculate on so few details. If general LIKE searches are acceptably fast (except N'-%') you'll need to share some more details about that procedure. You might like to try a temporary table instead of a table variable, just to see if statistics availability helps whatever plan you are getting. Alternatively, post the specific problematic plans!

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 11

SQL Kiwi (1/9/2012)
SQL Server can quite happily seek on two LIKE predicates, whether they overlap or not.

I know it can, but I've seen it do strange things with ORs before so...

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

aurato SSCrazy Points: 2575 More actions · Answer 12

GilaMonster (1/9/2012)
There's still a question? Thought from your comment you were happy.
Want anything useful? Post actual execution plans and queries and table designs. (p.s. you have huge tables, don't expect instant results)

I thought my post indicated that I was still stuck with the slowness. Sorry I didn't get that across.

Ninja's_RGR'us (1/9/2012)
aurato (1/9/2012)
Bumping
This is not paid work, you're not our #1 priority.
That fee is 125$ / hour and up if you're interested.

Is bumping considered rude on this forum? I was worried that my question was considered closed, so I bumped. Judging by the above quote, I was correct to.

SQL Kiwi (1/9/2012)
GilaMonster (1/9/2012)
While that's listed as a seek (for the like), it's almost a full table scan. The predicate is Keyword > '-', which every single entry starting with A..Z, a..z, so SQL's pretty much reading the entire table. It's because of the OR, the optimiser's decided for whatever reason to do a very non-restrictive seek predicate as the index seek, then filter out the rows that don't match...
It's nothing to do with the OR. SQL Server can quite happily seek on two LIKE predicates, whether they overlap or not. If the LIKE terms are constants or run-time constants of the form '<string>%', the optimizer will resolve the predicate to either one or two seeks in the same operator, depending on whether one range subsumes the other or not. For example "Keyword LIKE N'series%' OR dkl.Keyword LIKE N'ser%'" results in a single range seek, whereas "Keyword LIKE N'series%' OR dkl.Keyword LIKE N'set%'" results in two distinct seek ranges in the same operator. Example:
SELECT
dkl.Keyword,
dkl.ItemId
FROM dbo.DescriptionKeywordLookup AS dkl WITH (FORCESEEK)
WHERE
dkl.Keyword LIKE N'series%' OR dkl.Keyword LIKE N'snow%'
There is a specific issue with searching on '-%', which appears to be a collation subtlety (or perhaps a SQL Server bug) related to the '-' character. This specific issue (failure to find the end of the LIKE range) does not occur if the column has a binary collation, for example. The other thing is I noticed is that the example queries specify single-byte character strings in the LIKE predicates. It's not a big concern here, but they really ought to be Unicode strings, e.g. N'series%'.

This post cleared up basically all of my remaining confusion. Thank you very much.

SQL Kiwi (1/9/2012)
aurato (1/9/2012)
Also the UNION ALL took roughly the same amount of time regardless. It's very confusing.
It's a shame Azure does not support full-text indexing yet. Question: do you see poor performance just for the LIKE N'-%' query, or is it a problem for all search strings? As I mentioned previously, LIKE N'-%' results in a very inefficient search for some reason, but other cases optimize well and ought to be relatively quick.
A table variable may not be the best choice for your JOIN query, though it is hard to speculate on so few details. If general LIKE searches are acceptably fast (except N'-%') you'll need to share some more details about that procedure. You might like to try a temporary table instead of a table variable, just to see if statistics availability helps whatever plan you are getting. Alternatively, post the specific problematic plans!

The LIKE query without the '-' works fine. Thanks again. I'm going to retry my procedure and omit the hyphen to see if that's what was causing the whole thing.

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 13

Bumping is always frowned upon here (unless the next day or something like that).

I didn't think this was done neither did Paul.

Paul White SSC Guru Points: 150468 More actions · Answer 14

aurato (1/9/2012)
This post cleared up basically all of my remaining confusion. Thank you very much.

You're welcome.

aurato (1/9/2012)
The LIKE query without the '-' works fine. Thanks again. I'm going to retry my procedure and omit the hyphen to see if that's what was causing the whole thing.

I'm thinking the issue with the hyphen (or dash) character has to do with interpreting it in a culture-aware fashion. I'll post back if I get something more concrete on this. By the way, unless you really need Unicode with full language and culture aware comparisons, I would consider using VARCHAR instead of NVARCHAR - SQL collations typically compare and sort several times faster than Unicode collations. Faster still are the binary collations, but these are not suitable unless your LIKE comparisons can tolerate a strict byte-by-byte comparison (e.g. 'q' = 'Q'). Obviously, changing data type and/or collation would be a big job, given the size of the data, but if it is appropriate for you, it could be a big performance win. Another idea would be to hash words to integers, and seek on integer hash values.

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi