Use of a VARCHAR(MAX) in a Non-Clustered Index?

  • I'm a DBA responsible for both operations and engineering development for our databases. I rarely design tables, only approve or disapprove of what coders think should be in a table. I do, however, have the final say on all index implementations which I create myself.

    Recently, a developer wanted to create a stored procedure that included a VARCHAR(MAX) column in a WHERE clause (without a LIKE reserved word). All of my knowledge and experience tells me that this is going to be very inefficient and resource intensive query. I believe that indexing a VARCHAR(MAX) column is a big mistake.

    Can anyone share explicit experience they have with the implementation of this scenario and the consequences of it? I need a qualified second opinion to give an absolute "NO" to this proposed implementation.

    LC

  • I wouldn't do it.

    As an alternative, maybe you could do a Full-text index on that column.

    It will be an expensive Index (Generally speaking) if you do decide to create a NC for that column.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • You cannot create an index that has a varchar(max) as a key column.

    CREATE TABLE Test (

    id INT,

    StringCol VARCHAR(MAX)

    )

    CREATE INDEX idx_wontwork ON Test (StringCol)

    Msg 1919, Level 16, State 1, Line 2

    Column 'StringCol' in table 'Test' is of a type that is invalid for use as a key column in an index.

    You can add a column of type varchar(max) as an include column, but that isn't seekable, plus will duplicate the column completely (twice the storage)

    Why would someone be doing a direct string comparison (without like) on a varchar(max)? Unless the column has the incorrect data type, that could be thousands of characters of string.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Thanks for your reply, Jason.

    And thanks for your indisputable reply, Gail. It will end all debate on the subject. In answer to your "Why?" question, it's because this particular programmer has no concept of the need for efficiency.

    LC

  • crainlee2 (1/15/2010)


    Thanks for your reply, Jason.

    And thanks for your indisputable reply, Gail. It will end all debate on the subject. In answer to your "Why?" question, it's because this particular programmer has no concept of the need for efficiency.

    LC

    Heh... apparently no concept of design, either. 😛

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • crainlee2 (1/15/2010)


    Thanks for your reply, Jason.

    And thanks for your indisputable reply, Gail. It will end all debate on the subject. In answer to your "Why?" question, it's because this particular programmer has no concept of the need for efficiency.

    LC

    You're welcome

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • If you really need to do an exact match lookup on a varchar(max) column, you can add a computed column that contains a checksum of the varchar(max) column, and index that. This is a very fast way to do lookups on large strings, but will only work for exact matches. Note that different values can have the same checksum, so you must still do an equality check on the actual column.

    The idea is that a lookup using an index on an integer checksum value will be extremely fast and then the equality check will only have to be made on a small number of duplicate checksums.

    You could also implement this using an indexed view with a checksum or by adding an actual checksum column to the table that has a constraint that forces the value to match the checksum of the varchar(max) column.

    Sample lookup code:

    declare @MyLookupStringvarchar(max)

    set @MyLookupString ='My Lookup value'

    select

    *

    from

    Mytable

    where

    -- Find candidate exact matches using checksum index

    MyVarcharMaxChecksum = checksum(@MyLookupString)and

    -- Verify exact match to lookup value

    MyVarcharMax = @MyLookupString

  • crainlee2 (1/15/2010)


    In answer to your "Why?" question, it's because this particular programmer has no concept of the need for efficiency.

    Nor appropriate data types, I would assume.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Michael Valentine Jones (1/16/2010)


    If you really need to do an exact match lookup on a varchar(max) column, you can add a computed column that contains a checksum of the varchar(max) column, and index that. This is a very fast way to do lookups on large strings, but will only work for exact matches. Note that different values can have the same checksum, so you must still do an equality check on the actual column.

    The idea is that a lookup using an index on an integer checksum value will be extremely fast and then the equality check will only have to be made on a small number of duplicate checksums.

    You could also implement this using an indexed view with a checksum or by adding an actual checksum column to the table that has a constraint that forces the value to match the checksum of the varchar(max) column.

    Michael brings up an excellent method. I just want to emphasize a caution that he also mentioned...

    Because the built in CHECKSUM functions are nothing more than a simple character-by-character "Exclusive OR", be [font="Arial Black"]VERY [/font]aware that two different values can, in fact, have the same checksum. Matching checksums (using the built in CHECKSUM functions) are no guarantee of a perfect match. Non matching checksums are, however, a guarantee that they are not a perfect match. If there is a match (again, using the built in CHECKSUM functions), additional checking [font="Arial Black"]must[/font] still be done to guarantee the match. Of course, using CHECKSUM helps you narrow down the field very quickly as Micheal says.

    If you're using certain other methods to do the hashing (I forget what they're called... MD5???), then the additional checking will not be necessary.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • GilaMonster (1/16/2010)


    crainlee2 (1/15/2010)


    In answer to your "Why?" question, it's because this particular programmer has no concept of the need for efficiency.

    Nor appropriate data types, I would assume.

    Exactly. There is a strong tendency in our Engineering department to use VARCHAR(MAX) datatypes instead of VARCHAR(N) datatypes, which is generally in the category of a long-standing bad habit rather than a design requirement deliberately thought through.

    I'm currently involved the on-going process of educating our developers on T-SQL, SQL Server, and database best practices, explaining why different implementations that appear to produce identical results, are, in execution, more or less efficient than others, and why they should care about it. They're a good group of engineers but like many software developers, they specialize in software algorithms, not database technologies. The process of education on a foreign technology takes time. And old habits die hard.

    It's taken a while but I've finally been allowed to become involved in certain aspects of database and table design. I'm asking developers to defend their design choices, among other things. It's having the intended effect, of educating them on the need to design minimally and carefully, with a lot of fore-thought on exactly how the data is to be accessed and utilized.

    LC

  • Michael and Jeff,

    Thank you for your contributions on using CHECKSUM as a way of searching for equalities on VARCHAR(MAX) datatypes.

    I think there are a few places in our databases where we can put that to use.

    LC

  • You bet... thanks for the feedback, Lee.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • crainlee2 (1/16/2010)


    Michael and Jeff,

    Thank you for your contributions on using CHECKSUM as a way of searching for equalities on VARCHAR(MAX) datatypes.

    I think there are a few places in our databases where we can put that to use.

    LC

    I didn't mention it before, but this is a good way to handle this for long string columns, say varchar(50), instead of putting an index on the column.

  • I really think that it is appropriate to understand what the developer is trying to acomplish.

    Understanding the use-case will point you in the correct direction and allow you to better council the developer and provide better support.

    Without fully understanding the use-case you are really just shooting in the dark.

    As someone already pointed out, a full-text index may be the appropriate soltuion, but without understanding the application there is no way to determine the answer.

  • Jeff Moden (1/16/2010)

    If you're using certain other methods to do the hashing (I forget what they're called... MD5???), then the additional checking will not be necessary.

    Even with MD5 or other hashing methods there's still a possibility of a checksum clash--it might not be as likely as with a simple XOR checksum but it's still possible. The possibility increases the more characters you have in your string, too.

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply