Heap vs Clustered Wildcard Search

  • Ed Wagner - Thursday, May 25, 2017 10:45 AM

    ScottPletcher - Thursday, May 25, 2017 10:33 AM

    Just adding a clustered index shouldn't slow it down that much, unless perhaps the fillfactor is (way) too low.  Be sure to explicitly specify something like 95+% for the fillfactor when creating the clustered index.

    If you have a more natural key that is used for searches, such as BrandID, cluster on it first, then on identity (just to make the clus key unique).

    Or, leave the identity in place and create an NCI on BrandID and whatever other always-used, highly-selective columns being the leading columns.  Creating the clustering key on (BrandID, identity) would result in CI key values that are not ever-increasing.  It also means that each NCI would inherit more bytes from the clustering key than necessary, increasing the number of pages to store each NCI.

    Possibly, but it's more complicated than that.  The best clustering key is the most important performance factor, and an extremely general rule about "short, ever-increasing, etc" is cute but it's not the be-all and end-all of proper key selection.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • It sounds as if the design of your table may not be correct.  These are items, possibly an inventory system????

    The combination of the manufacturers code, which you may need to create, and the manufacturers part number should be unique.
    The relevant columns, as I see it, having designed a few inventory systems are as follows:
    Item_ID = identity works.  
    Manufacturers_Code = Probably an int referring to the manufacturers table
    Manufacturers_Part_Number = probably a character field, this value is provided by the manufacturer
    These two fields should be unique, and should be a unique index. I would make the clustered primary key the item_id, unless you have very few inserts of items occurring.  That will help with fragmentation. 
    Manufacturers_Description = this value is provided by the manufacturer
    Our_Description = Your description.  The manufacturers description may be 1.5x6 nip.  Users may need to see 1 1/2 x 6 Black Pipe Nipple

    There are also probably industry or company specific columns that may need to be added, and things like MSDS, maybe a link to the picture(s). 

    Pricing information should not be in this table. That should be a separate table.  You need to maintain a history of this.
    Packing info should also be in another table. Such as each, case of 6, case of 144 and so forth.  That is usually tied to pricing.  
    Instructions and documentation should also be in a separate table.  Typically, things like this are the same for multiple items from a given manufacturer.  Think of the instructions you get in a package.  
    You may need column(s) for location.  BUT, this may need to be in a separate table if items can be stored in multiple places. 

    Do you have situations such as related items?  
    A few examples. You can't get a diamond core bit without the shank.  That relationship needs to be created. 
    Cross-related items?  A garden hose, and the hose nozzles. 
    Substitute items? Eveready D cell can be substituted for Duracell D Cell.  
    To further muddy it up, the individual clients may define these things. 

    If there are commodity items such as dimensional lumber, you may now have to handle things like a white wood 2x4x8. This may come from many different manufacturers, but the item needs to be the same item number.  And the cost information may need averaged, the highest used, or something else. 

    I'm also willing to bet that this query you are struggling with is for a "Parts Search".  I suggest you find out what the users are searching on.  In the system we designed, there were 150+ different criteria that the users could search on.  Digging in and gathering statistics, we figured out that they used the manufacturers code almost every time. 
    We tuned the code and tables to search on that field.  Ta Da.  Searches now took milliseconds.  If they used one of the other criteria, we implemented dynamic SQL.  Slower, but still faster that the old code that tried to handle every possible situation every time it ran. 

    Ed Wagner, you saw this presentation at SQL Saturday a few years ago in Cleveland.  The whole thing was based upon an item table design and the subsequent search.

    Michael L John
    If you assassinate a DBA, would you pull a trigger?
    To properly post on a forum:
    http://www.sqlservercentral.com/articles/61537/

  • Michael L John - Thursday, May 25, 2017 11:38 AM

    It sounds as if the design of your table may not be correct.  These are items, possibly an inventory system????

    The combination of the manufacturers code, which you may need to create, and the manufacturers part number should be unique.
    The relevant columns, as I see it, having designed a few inventory systems are as follows:
    Item_ID = identity works.  
    Manufacturers_Code = Probably an int referring to the manufacturers table
    Manufacturers_Part_Number = probably a character field, this value is provided by the manufacturer
    These two fields should be unique, and should be a unique index. I would make the clustered primary key the item_id, unless you have very few inserts of items occurring.  That will help with fragmentation. 
    Manufacturers_Description = this value is provided by the manufacturer
    Our_Description = Your description.  The manufacturers description may be 1.5x6 nip.  Users may need to see 1 1/2 x 6 Black Pipe Nipple

    There are also probably industry or company specific columns that may need to be added, and things like MSDS, maybe a link to the picture(s). 

    Pricing information should not be in this table. That should be a separate table.  You need to maintain a history of this.
    Packing info should also be in another table. Such as each, case of 6, case of 144 and so forth.  That is usually tied to pricing.  
    Instructions and documentation should also be in a separate table.  Typically, things like this are the same for multiple items from a given manufacturer.  Think of the instructions you get in a package.  
    You may need column(s) for location.  BUT, this may need to be in a separate table if items can be stored in multiple places. 

    Do you have situations such as related items?  
    A few examples. You can't get a diamond core bit without the shank.  That relationship needs to be created. 
    Cross-related items?  A garden hose, and the hose nozzles. 
    Substitute items? Eveready D cell can be substituted for Duracell D Cell.  
    To further muddy it up, the individual clients may define these things. 

    If there are commodity items such as dimensional lumber, you may now have to handle things like a white wood 2x4x8. This may come from many different manufacturers, but the item needs to be the same item number.  And the cost information may need averaged, the highest used, or something else. 

    I'm also willing to bet that this query you are struggling with is for a "Parts Search".  I suggest you find out what the users are searching on.  In the system we designed, there were 150+ different criteria that the users could search on.  Digging in and gathering statistics, we figured out that they used the manufacturers code almost every time. 
    We tuned the code and tables to search on that field.  Ta Da.  Searches now took milliseconds.  If they used one of the other criteria, we implemented dynamic SQL.  Slower, but still faster that the old code that tried to handle every possible situation every time it ran. 

    Ed Wagner, you saw this presentation at SQL Saturday a few years ago in Cleveland.  The whole thing was based upon an item table design and the subsequent search.

    Yes, I remember that.  IIRC, you had an absolutely dizzying search form where the user could enter everything under the sun and then add two more search criteria.

  • Michael L John - Thursday, May 25, 2017 11:38 AM

    It sounds as if the design of your table may not be correct.  These are items, possibly an inventory system????

    The combination of the manufacturers code, which you may need to create, and the manufacturers part number should be unique.
    The relevant columns, as I see it, having designed a few inventory systems are as follows:
    Item_ID = identity works.  
    Manufacturers_Code = Probably an int referring to the manufacturers table
    Manufacturers_Part_Number = probably a character field, this value is provided by the manufacturer
    These two fields should be unique, and should be a unique index. I would make the clustered primary key the item_id, unless you have very few inserts of items occurring.  That will help with fragmentation. 
    Manufacturers_Description = this value is provided by the manufacturer
    Our_Description = Your description.  The manufacturers description may be 1.5x6 nip.  Users may need to see 1 1/2 x 6 Black Pipe Nipple

    There are also probably industry or company specific columns that may need to be added, and things like MSDS, maybe a link to the picture(s). 

    Pricing information should not be in this table. That should be a separate table.  You need to maintain a history of this.
    Packing info should also be in another table. Such as each, case of 6, case of 144 and so forth.  That is usually tied to pricing.  
    Instructions and documentation should also be in a separate table.  Typically, things like this are the same for multiple items from a given manufacturer.  Think of the instructions you get in a package.  
    You may need column(s) for location.  BUT, this may need to be in a separate table if items can be stored in multiple places. 

    Do you have situations such as related items?  
    A few examples. You can't get a diamond core bit without the shank.  That relationship needs to be created. 
    Cross-related items?  A garden hose, and the hose nozzles. 
    Substitute items? Eveready D cell can be substituted for Duracell D Cell.  
    To further muddy it up, the individual clients may define these things. 

    If there are commodity items such as dimensional lumber, you may now have to handle things like a white wood 2x4x8. This may come from many different manufacturers, but the item needs to be the same item number.  And the cost information may need averaged, the highest used, or something else. 

    I'm also willing to bet that this query you are struggling with is for a "Parts Search".  I suggest you find out what the users are searching on.  In the system we designed, there were 150+ different criteria that the users could search on.  Digging in and gathering statistics, we figured out that they used the manufacturers code almost every time. 
    We tuned the code and tables to search on that field.  Ta Da.  Searches now took milliseconds.  If they used one of the other criteria, we implemented dynamic SQL.  Slower, but still faster that the old code that tried to handle every possible situation every time it ran. 

    Ed Wagner, you saw this presentation at SQL Saturday a few years ago in Cleveland.  The whole thing was based upon an item table design and the subsequent search.

    Yes, indeed an inventory system. All of the internal lookups by the application are on the "ID" column for each table. E.G. ItemID, BrandID, etc. I have been clustering on those however this particular search is for an ItemCode (aka sku). We create our own ItemCode based on the brand part number and a brand prefix code. So part number 123 will end up being ABC123 (ABC is the brand prefix code). Still the system uses the integer "ID" column to do the joins and lookups for page loads (e.g. viewItem.aspx?ItemID=1).

    I've got something worked out for "containers" by having a separate table and a container indicator in case something needs to be sold together. Pricing and packing dims ARE in this table which is why we have such a large column count. I am considering moving some of that info out as suggested in the near future. E.G.
    ITEM
    ITEM_DIMS
    ITEM_PRICE
    ITEM_ETC

    Locations are kept in a separate table with the ItemID, LocationID and Qty.

    Thanks for all of the great info and suggestion!

  • josh-1127203 - Tuesday, May 16, 2017 12:42 PM

    OK, thanks for the replies. Something else I wanted to point out is that if I take the ID and ItemCode into a new table as the only columns, either as a heap or clustered index, it's super fast with the wildcard. There are ~ 800k rows, but we anticipate this to grow to between 2-5 million. The original table has a lot of columns.

    Go for it.

    I was about suggesting this approach, but I see you've got there by yourself.

    The reason why the heap was superfast is that the columns used in the query (the text one and ID) were placed in memory after the 1st search, and all subsequent searches were going against the buffered data.

    A separate lookup table will be emulating your "heap" behavior, and the join by a PK will retrieve the rest of the relevant data in no time.

    A view won't help there, imho.

    _____________
    Code for TallyGenerator

  • josh-1127203 - Wednesday, May 17, 2017 9:47 AM

    I wouldn't say that the part number thing is horribly overloaded, but I understand where you are going with that. We take the manufacturer part number (123) and add a prefix (ABC or DEF, etc) to create a unique ItemCode (ABC123) in our system. I am looking for a long term solution here, not a band aide so I certainly appreciate you pushing to look at the underlying issue. With that said, I could add and add'l column for MfrPartNumber and put the second part of the ItemCode in there, then index both columns and remove the wildcard prefix.

    E.G.
    ItemCode: ABC123
    MfrPN: 123
    ItemCode: DEF123
    MfrPN: 123

    Would that be the optimal approach that you would go after for a long term solution?

    And to complicate matters, on the same subject, we also have a Descr column that holds a product description (e.g. "Red Widget with Triangle Pieces"); we also have a search for this column (optional) whereas we may need to lookup all things with Triangle. Is this column best served with Full Text Index?

    Again, thanks in advance for your feedback. It's much appreciated.

    You should not be merging different codes into one.
    You're losing data by doing this.
    If you're searching for a manufacturer code 456 you don't want to see a part CDE23456.
    To find what exactly you're looking for you need to know where is the separation point between 2 merged codes, and for that - you need to split them back. Which is way more complicated than to have them stored separately in the first place.

    Not to mention - data normalization significantly reduce the number of records you need to search in.

    _____________
    Code for TallyGenerator

  • Sergiy - Wednesday, June 7, 2017 5:51 AM

    You should not be merging different codes into one.
    You're losing data by doing this.
    If you're searching for a manufacturer code 456 you don't want to see a part CDE23456.
    To find what exactly you're looking for you need to know where is the separation point between 2 merged codes, and for that - you need to split them back. Which is way more complicated than to have them stored separately in the first place.

    Not to mention - data normalization significantly reduce the number of records you need to search in.

    Agreed.  Keep them separate.  You can always combine them later in the application code or as a calculated column.  If you try to split them later instead, you risk data quality problems.

    I would also opine that your manufacturer code should be stored as a character string.  Manufacturer Code 0456 is not the same as 456, but they will be if stored as an integer.  This takes some extra storage, but can prevent logic bugs when working with the data later.

    Wes
    (A solid design is always preferable to a creative workaround)

Viewing 7 posts - 31 through 36 (of 36 total)

You must be logged in to reply to this topic. Login to reply