Heap vs Clustered Wildcard Search

  • I am working on converting a heap table to a clustered table by adding a primary key to the identity column:

    TestTable
    ——————-
    ID
    ItemCode
    Descr
    (20 more columns; etc.)

    Right now the table is a heap and I have a query that does a wildcard against the ItemCode column (varchar(250)):
    SELECT ID, ItemCode, Descr, etc. FROM TestTable WHERE ItemCode LIKE ‘%something%’;

    On a fresh SQL restart, if I run this query, it takes about 25 seconds, but subsequent queries take 1-2 seconds. After I add the primary key to convert the heap to a clustered table, if I run the same query as above, it takes about the same amount of time to get the first result, but subsequent results take 5-10 seconds.

    The execution codes basically shows TestTable Scan for the first, and Clustered Index Scan for the second. Why would the heap return faster results on subsequent queries than the clustered?

    Also, adding an index to the ItemCode table and changing the query to LIKE ‘something%’ is super fast for both.

    I really want to get this table over to cluster but doing so slows the query that is used. Any suggestions?

  • Since you have a search like that, it can't possibly use any indexes, so adding one is unlikely to speed things up. At least one resulting factor is that a clustered index has to walk the tree as well as scan all the pages while a heap just scans the pages. In terms of just reading everything, every time, a heap is likely to be faster. If you can't modify that code in order to eliminate the need to scan everything, the only way to speed things up will be to get more memory and buy a faster disk.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Thanks for the reply. The heap is definitely faster however the problem with leaving as a heap is that our software uses the ID column primarily for joins, selects and updates so the benefit of the cluster is what we are after. Would adding a full text index to the ItemCode column address if we switch to clustered index and change our tsql to contains? Or, is it better practice to avoid the wildcard prefix all together (although it is helpful for our application).

  • To expand on what Grant said, it's the leading % in your search.  The B-Tree of any index is in the order of the key columns.  Strings are sorted by the first characters.  There's no way the optimizer could do a seek on anything when searching against an expression with a leading wildcard because that's not the order of the data.

    An NCI could be used to cover a query and the optimizer might use it to reduce the number of reads, but it will always be a scan as long as you have a leading wildcard in there.

  • A full text index might help. They're mainly meant for searching within books, that sort of thing, not so much a short varchar field. You could test it to find out.

    I'd follow Ed's advice.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • OK, thanks for the replies. Something else I wanted to point out is that if I take the ID and ItemCode into a new table as the only columns, either as a heap or clustered index, it's super fast with the wildcard. There are ~ 800k rows, but we anticipate this to grow to between 2-5 million. The original table has a lot of columns.

    Could a view help solve this problem or maybe just use a table with less columns as a lookup table if we have to have the wildcard prefix? Just throwing out a few ideas.

  • I tried a View with only the needed columns and that didn't seem to help. The query specific NCI seems to work pretty good, returning results first go in 2-4 seconds. Is that a scalable solution as we grow to millions of rows?

    One thing that I could consider doing is not offer the wildcard prefix without other search criteria, but offer the wildcard prefix when other search criteria is provided to help narrow down the rows to scan. E.G.

    SELECT ID, ItemCode, Descr, etc. FROM TestTable WHERE ItemCode LIKE ‘something%’
    VS
    SELECT ID, ItemCode, Descr, etc. FROM TestTable WHERE ItemCode LIKE ‘%something%’ AND BrandID=1

    I could add/remove that functionality in our software and present the user with a GUI message that makes that clear. That is super fast at current because of the 800k rows and there is an index for BrandID (there are around 600 brands wich may grow to 3-5k).

    How do others handle providing wildcard searching with a lot of data!?

  • josh-1127203 - Tuesday, May 16, 2017 1:18 PM

    I tried a View with only the needed columns and that didn't seem to help. The query specific NCI seems to work pretty good, returning results first go in 2-4 seconds. Is that a scalable solution as we grow to millions of rows?

    One thing that I could consider doing is not offer the wildcard prefix without other search criteria, but offer the wildcard prefix when other search criteria is provided to help narrow down the rows to scan. E.G.

    SELECT ID, ItemCode, Descr, etc. FROM TestTable WHERE ItemCode LIKE ‘something%’
    VS
    SELECT ID, ItemCode, Descr, etc. FROM TestTable WHERE ItemCode LIKE ‘%something%’ AND BrandID=1

    I could add/remove that functionality in our software and present the user with a GUI message that makes that clear. That is super fast at current because of the 800k rows and there is an index for BrandID (there are around 600 brands wich may grow to 3-5k).

    How do others handle providing wildcard searching with a lot of data!?

    The view wouldn't be of much help because the table still has to be read.

    For why the NCI works, think about what it is:  It's duplicated data.  It's only the data you need and it's arranged in a specific way that meets the need for speed.  Because you limit the number of columns in your index, you don't have to read the entire table / leaf level.  Right out of the gate, you're only reading the columns you have to read.  One key to maintaining good performance on NCIs is to make sure your statistics are updated regularly.  Your DBA should already have something in place for this.  If not, put them to work. 😉

    When you mentioned narrowing down the rows to scan, it sounds like you're thinking the query is filtering out rows before it executes.  It can't do this.  It has to apply all predicates of the WHERE clause to filter out the rows.  If that means it has to read the whole leaf level to process one of the predicates, then that's what it does.  It's not only what has to be read but also how it's read.

    That brings me to execution plans.  Look at them - they'll tell you what steps are being done.  It's where you'll find your seeks and scans.  If you aren't familiar with them, check out Grant's book on them at http://www.sqlservercentral.com/articles/books/94937/.  There's a lot of good stuff in there and is well worth the read.

    If you have the opportunity to eliminate the leading wildcard by default and that'll take care of 90% of your user queries, I'd do it if it wouldn't result in a mutiny.  If the 10% of queries that require the leading wildcard character are forced to specify it in the application, then you've lightened your load significantly and 90% of the users will be happier because their stuff runs faster.

  • Depending on the length and available storage, this could be an option.
    https://sqlperformance.com/2017/02/sql-indexes/seek-leading-wildcard-sql-server

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • "When you mentioned narrowing down the rows to scan, it sounds like you're thinking the query is filtering out rows before it executes. It can't do this. It has to apply all predicates of the WHERE clause to filter out the rows. If that means it has to read the whole leaf level to process one of the predicates, then that's what it does. It's not only what has to be read but also how it's read."

    SELECT ID, ItemCode, Descr, etc. FROM TestTable WHERE ItemCode LIKE ‘%something%’ AND BrandID=1

    Yields:
    Index Seek on the NonClustered index for the BrandID 1%
    Key Lookup on the Clustered index of the ID for 95%
    If I restart SQL and run the above, I get a result back in 1 second.

    SELECT ID, ItemCode, Descr, etc. FROM TestTable WHERE ItemCode LIKE ‘%something%’

    Yields:
    Clustered Index Scan 93%
    If I restart SQL and run the above I get a result back in 12 seconds.

    I believe you that the narrowing approach is the same but the results show a different result. Am I missing something?

  • josh-1127203 - Tuesday, May 16, 2017 12:42 PM

    OK, thanks for the replies. Something else I wanted to point out is that if I take the ID and ItemCode into a new table as the only columns, either as a heap or clustered index, it's super fast with the wildcard. There are ~ 800k rows, but we anticipate this to grow to between 2-5 million. The original table has a lot of columns.

    Could a view help solve this problem or maybe just use a table with less columns as a lookup table if we have to have the wildcard prefix? Just throwing out a few ideas.

    "lots of columns"

    This means you're scanning a very wide table. Taking it down to two columns, the ID & ItemCode means you're scanning a much smaller table. More rows per page, fewer pages, faster behavior.

    A view is just a query. So will a query on  query run faster? Probably not. If you materialize the view, effectively creating a new table, and you don't have to use the wild card to search, yeah, that could be better. Of course, you're taking on the overhead of maintaining the materialized view.

    Why a wild card? If it's because the ItemCode is overloaded and you need to find a subset, something like 'xxx-12345-yyy' and what you're interested in is '12345' then you might be better off with a calculated column. You can index that and possibly get great performance instead of wild card searches.

    Performance comes down to picking and choosing correct structures too. If this is a third party app, your choices are extremely limited. If you're building it, from the sound of things, lots of columns, wild card searches, I'd say you need to rearchitect the storage. It doesn't sound properly normalized at all.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • josh-1127203 - Tuesday, May 16, 2017 1:18 PM

    I tried a View with only the needed columns and that didn't seem to help. The query specific NCI seems to work pretty good, returning results first go in 2-4 seconds. Is that a scalable solution as we grow to millions of rows?

    One thing that I could consider doing is not offer the wildcard prefix without other search criteria, but offer the wildcard prefix when other search criteria is provided to help narrow down the rows to scan. E.G.

    SELECT ID, ItemCode, Descr, etc. FROM TestTable WHERE ItemCode LIKE ‘something%’
    VS
    SELECT ID, ItemCode, Descr, etc. FROM TestTable WHERE ItemCode LIKE ‘%something%’ AND BrandID=1

    I could add/remove that functionality in our software and present the user with a GUI message that makes that clear. That is super fast at current because of the 800k rows and there is an index for BrandID (there are around 600 brands wich may grow to 3-5k).

    How do others handle providing wildcard searching with a lot of data!?

    If you add additional filtering criteria, that can, possibly, completely eliminate scans because of the wild card, or at least, limit it to only range scans. However, that means you'll have to index properly, probably on that BrandID column first, then on the ItemCode column as a second key on the index. It's just hard to say.

    Wild card searches of that type don't work on lots of data. Your structure should support searches that don't require it. OR, if you are looking at primarily a text only database, you architect differently and bring other tools to bear (yeah, possibly fulltext index, but also other tool sets entirely that are made for indexing text and finding anything, Google and Microsoft (Bing) sell search tools based on their tech. That would be what you want if you're looking at very large text-based searches.

    However, I don't think you are. It seems like you have some database design issues to deal with. Just based on the name alone, ItemCode should be a very straight forward field to search against. If it's been overloaded with lots of extra meaning, break that stuff down into multiple columns. Also, no one ever believes me when I say this, but normalization helps performance. Having lots and lots of columns in a single table does not work better than a properly normalized database with good primary & foreign key constraints in place along with proper indexing (and yeah, there are exceptions to this, but exceptions are just that, exceptional).

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Thanks Grant for reviewing and providing great feedback and suggestions. I try to keep table columns as minimal as possible, implementing star approach, however, this table holds a multitude of data points and information for parts, so most of the columns relate directly to the individual part and would be difficult to break off into other tables (several price related columns, dims, etc, on/off toggles). There are several relationships for Brand, Category, etc that correlate with a BrandID, CategoryID, etc. to other tables. This is a custom application, and most tables use a Primary Key with an auto int value. In the application itself, these ID columns are used for most of the joins, page loads, etc.

    The ItemCode column is a part number. Internally in the app, we will call the part by the ID column (e.g. WHERE ID=1); We have several items with similar part numbers so we have the need to search wildcard...e.g. (ABC1234, DEF1234, ZZZ1234 > search for "1234" or "234"). There are close to 100 columns in this table. The only way I could reduce the columns is to split the table out from one table (ITEM) to multiple: ITEM, ITEM_DIMS, ITEM_PRICING, ITEM_TOGGLES, ITEM_ETC. Would it be better to have 8 tables with 10 columns and 2 million rows each or one table with 80 columns and 2 million rows?

    Or, I could maintain the current ITEM table and simply have a lookup table with two columns, search that for wildcard, return the ID to the app and do the actual real table lookup and return results or see how it performs on a join.

    So far I have as possible solutions:

    - rework table to reduce columns by splitting to multiple tables, thus reducing the scan time
    - discontinue wildcard prefix altogether (or reserve for special circumstances; probably only needed 10% of the time)
    - require add'l query to execute wildcard (e.g. require BrandID=) to help reduce the scan
    - lookup table with two columns
    - non clustered index for this specific query
    - indexed view (materialized view)

  • Grant Fritchey - Tuesday, May 16, 2017 10:55 AM

    Since you have a search like that, it can't possibly use any indexes, so adding one is unlikely to speed things up. At least one resulting factor is that a clustered index has to walk the tree as well as scan all the pages while a heap just scans the pages. In terms of just reading everything, every time, a heap is likely to be faster. If you can't modify that code in order to eliminate the need to scan everything, the only way to speed things up will be to get more memory and buy a faster disk.

     At least one resulting factor is that a clustered index has to walk the tree as well as scan all the pages while a heap just scans the pages
    Didn't know that everyday is a school day with SQL Server. Thanks Grant.

  • josh-1127203 - Tuesday, May 16, 2017 8:19 PM

    - rework table to reduce columns by splitting to multiple tables, thus reducing the scan time
    - lookup table with two columns
    - non clustered index for this specific query
    - indexed view (materialized view)

    I edited out some of the stuff. 

    So, in a star schema you're right, you won't be able to reduce the number of columns. However, that part number thing you have, that's horrifically overloaded. I would, if anything, break it down into it's component parts (in short, add more columns). So that you can search based on the parts rather than have to use the wild card.

    All the proposed solutions above are just ways to more or less keep the status quo and simply reduce the size of the scan. I strongly recommend you toss these ideas and focus on the mechanisms that can help you eliminate the scans entirely (or, reduce it to a range scan only). Remember, performance of a scan that scans the entire table is completely dependent on the speed of disks and amount of memory available. You will always have degrading performance as the number of rows to be scanned increases. For large (and getting larger tables) such as yours, mitigating the scan isn't solving the problem.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

Viewing 15 posts - 1 through 15 (of 36 total)

You must be logged in to reply to this topic. Login to reply