String as primary key

  • I have tables in database where a VARCHAR(50) string is unique identifier. The database currently has an integer identity column as clustered primary key, and there's a non-clustered index on string column. The customer always queries based on a defined set of the identifier (string) column.

    I wonder if someone sees an advantage of adding a persisted computed column to the table as the checksum of the string column, and then create a non-clustered index on the checksum and the string. When a customer requests data, we would compute the checksum of the customer provided identifier and add to the where clause or join, that the checksum and string must match.

    Will SQL Server perform checksum check (integer) and only if it succeeds, perform the string check, in which case I see an advantage of added the checksum column? Or will SQL Server always check for both the checksum and string, in which case the additional column only adds unnecessary overhead? To note is the fact that the table(s) will have millions of rows, but the customer will request data for at at most, 100 or so identifiers.

  • Is there a problem just letting them query the column directly?

  • There is no problem letting them query the string directly. I'm mostly interested on the performance, though.

  • Either cluster directly on the string itself. Or, if you don't want to do, encode the string into an int (or smallint if you have less than ~20,000 values) and cluster on that. Clustering on identity is often very damaging to overall performance.

    For example:

    CREATE TABLE dbo.strings (

    string_id int IDENTITY(1, 1),

    string varchar(50)

    )

    In the main table:

    string_id int

    You might also store the string, denormalized, in the main table as well.

    For lookups:

    SELECT ...

    FROM main_table

    WHERE

    string_id IN (SELECT string_id FROM strings WHERE string IN ('<list of string values to lookup'>))

    Edit: This code is intended only for use IF you decide to use an encoded number to represent the string value in the lead column of the clustering index. If you use the actual string itself, just code the WHERE condition normally:

    WHERE string IN (...list_of_string_values_to_match...)

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • I have done this in order to avoid having to create an index on the string column.

    Since this column is a unique identifier, you would probably still need a unique index or or unique constraint, so I doubt you would see much benefit.

  • N_Muller (2/26/2015)


    There is no problem letting them query the string directly. I'm mostly interested on the performance, though.

    Well... what's the current performance. Can you post the execution plan? What's the T-SQL query. How many records do you have in that column, and how selective are they.

  • Also, let's look at SQL's missing index and index usage stats for the table:

    USE [<your_db_name_here>] --change to desired db if not already there

    SET DEADLOCK_PRIORITY LOW --probably irrelevant, but just in case

    DECLARE @list_missing_indexes bit

    DECLARE @table_name_pattern sysname

    --NOTE: showing missing indexes can take some time; set to 0 if you don't want to wait.

    SET @list_missing_indexes = 1

    SET @table_name_pattern = '<your_table_name_goes_here>' --'%'=all tables.

    --SET @table_name_pattern = '%'

    PRINT 'Started @ ' + CONVERT(varchar(30), GETDATE(), 120)

    IF @list_missing_indexes = 1

    BEGIN

    SELECT

    GETDATE() AS capture_date,

    DB_NAME(mid.database_id) AS Db_Name,

    dps.row_count,

    OBJECT_NAME(mid.object_id /*, mid.database_id*/) AS Table_Name,

    mid.equality_columns, mid.inequality_columns, mid.included_columns,

    user_seeks, user_scans, ca1.max_days_active, unique_compiles,

    last_user_seek, last_user_scan, avg_total_user_cost, avg_user_impact,

    system_seeks, system_scans, last_system_seek, last_system_scan, avg_total_system_cost, avg_system_impact,

    mid.statement, mid.object_id, mid.index_handle

    FROM sys.dm_db_missing_index_details mid WITH (NOLOCK)

    CROSS APPLY (

    SELECT DATEDIFF(DAY, create_date, GETDATE()) AS max_days_active FROM sys.databases WHERE name = 'tempdb'

    ) AS ca1

    LEFT OUTER JOIN sys.dm_db_missing_index_groups mig WITH (NOLOCK) ON

    mig.index_handle = mid.index_handle

    LEFT OUTER JOIN sys.dm_db_missing_index_group_stats migs WITH (NOLOCK) ON

    migs.group_handle = mig.index_group_handle

    LEFT OUTER JOIN sys.dm_db_partition_stats dps WITH (NOLOCK) ON

    dps.object_id = mid.object_id AND

    dps.index_id IN (0, 1)

    --order by

    --DB_NAME, Table_Name, equality_columns

    WHERE

    1 = 1

    AND mid.database_id = DB_ID() --only current db

    AND OBJECT_NAME(mid.object_id) LIKE @table_name_pattern

    --AND mid.object_id IN (OBJECT_ID('<table_name_1>'), OBJECT_ID('<table_name_2>'))

    ORDER BY

    --avg_total_user_cost * (user_seeks + user_scans) DESC,

    Db_Name, Table_Name, equality_columns, inequality_columns

    END --IF

    PRINT 'Midpoint @ ' + CONVERT(varchar(30), GETDATE(), 120)

    -- list index usage stats (seeks, scans, etc.)

    SELECT

    ius2.row_num, DB_NAME() AS db_name,

    CASE WHEN i.name LIKE ca2.table_name + '%'

    THEN '~' + SUBSTRING(i.name, LEN(ca2.table_name) + 1, 200)

    ELSE i.name END AS index_name,

    CASE WHEN i.is_unique = 0 THEN 'N' ELSE 'Y' END + '.' +

    CASE WHEN i.is_primary_key = 0 AND i.is_unique_constraint = 0 THEN 'N' ELSE 'Y' END AS [uniq?],

    ca2.table_name,

    i.index_id, --ius.user_seeks + ius.user_scans AS total_reads,

    dps.row_count,

    SUBSTRING(key_cols, 3, 8000) AS key_cols, SUBSTRING(nonkey_cols, 3, 8000) AS nonkey_cols,

    ius.user_seeks, ius.user_scans, ius.user_lookups, ius.user_updates,

    ius.last_user_seek, ius.last_user_scan, ius.last_user_lookup, ius.last_user_update,

    fk.Reference_Count AS fk_ref_count,

    DATEDIFF(DAY, CASE WHEN o.create_date > ca1.sql_startup_date THEN o.create_date

    ELSE ca1.sql_startup_date END, GETDATE()) AS max_days_active,

    FILEGROUP_NAME(i.data_space_id) AS filegroup_name,

    ius.system_seeks, ius.system_scans, ius.system_lookups, ius.system_updates,

    ius.last_system_seek, ius.last_system_scan, ius.last_system_lookup, ius.last_system_update

    FROM sys.indexes i WITH (NOLOCK)

    INNER JOIN sys.objects o WITH (NOLOCK) ON

    o.object_id = i.object_id

    CROSS JOIN (

    SELECT create_date AS sql_startup_date FROM sys.databases WHERE name = 'tempdb'

    ) AS ca1

    CROSS APPLY (

    SELECT OBJECT_NAME(i.object_id/*, DB_ID()*/) AS table_name

    ) AS ca2

    OUTER APPLY (

    SELECT

    ', ' + COL_NAME(object_id, ic.column_id)

    FROM sys.index_columns ic

    WHERE

    ic.key_ordinal > 0 AND

    ic.object_id = i.object_id AND

    ic.index_id = i.index_id

    ORDER BY

    ic.key_ordinal

    FOR XML PATH('')

    ) AS key_cols (key_cols)

    OUTER APPLY (

    SELECT

    ', ' + COL_NAME(object_id, ic.column_id)

    FROM sys.index_columns ic

    WHERE

    ic.key_ordinal = 0 AND

    ic.object_id = i.object_id AND

    ic.index_id = i.index_id

    ORDER BY

    COL_NAME(object_id, ic.column_id)

    FOR XML PATH('')

    ) AS nonkey_cols (nonkey_cols)

    LEFT OUTER JOIN sys.dm_db_partition_stats dps WITH (NOLOCK) ON

    dps.object_id = i.object_id AND

    dps.index_id = i.index_id

    LEFT OUTER JOIN sys.dm_db_index_usage_stats ius WITH (NOLOCK) ON

    ius.database_id = DB_ID() AND

    ius.object_id = i.object_id AND

    ius.index_id = i.index_id

    LEFT OUTER JOIN (

    SELECT

    database_id, object_id, MAX(user_scans) AS user_scans,

    ROW_NUMBER() OVER (ORDER BY MAX(user_scans) DESC) AS row_num --user_scans|user_seeks+user_scans

    FROM sys.dm_db_index_usage_stats WITH (NOLOCK)

    WHERE

    database_id = DB_ID()

    --AND index_id > 0

    GROUP BY

    database_id, object_id

    ) AS ius2 ON

    ius2.database_id = DB_ID() AND

    ius2.object_id = i.object_id

    LEFT OUTER JOIN (

    SELECT

    referenced_object_id, COUNT(*) AS Reference_Count

    FROM sys.foreign_keys

    WHERE

    is_disabled = 0

    GROUP BY

    referenced_object_id

    ) AS fk ON

    fk.referenced_object_id = i.object_id

    WHERE

    i.object_id > 100 AND

    i.is_hypothetical = 0 AND

    i.type IN (0, 1, 2) AND

    o.type NOT IN ( 'IF', 'IT', 'TF', 'TT' ) AND

    (

    o.name LIKE @table_name_pattern AND

    o.name NOT LIKE 'dtprop%' AND

    o.name NOT LIKE 'filestream[_]' AND

    o.name NOT LIKE 'MSpeer%' AND

    o.name NOT LIKE 'MSpub%' AND

    --o.name NOT LIKE 'queue[_]%' AND

    o.name NOT LIKE 'sys%'

    )

    --AND OBJECT_NAME(i.object_id /*, DB_ID()*/) IN ('tbl1', 'tbl2', 'tbl3')

    ORDER BY

    --row_count DESC,

    --ius.user_scans DESC,

    --ius2.row_num, --user_scans&|user_seeks

    db_name, table_name,

    -- list clustered index first, if any, then other index(es)

    CASE WHEN i.index_id IN (0, 1) THEN 1 ELSE 2 END,

    key_cols

    PRINT 'Ended @ ' + CONVERT(varchar(30), GETDATE(), 120)

    SET DEADLOCK_PRIORITY NORMAL

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • ScottPletcher (2/26/2015)


    Either cluster directly on the string itself. Or, if you don't want to do, encode the string into an int (or smallint if you have less than ~20,000 values) and cluster on that. Clustering on identity is often very damaging to overall performance.

    For example:

    CREATE TABLE dbo.strings (

    string_id int IDENTITY(1, 1),

    string varchar(50)

    )

    In the main table:

    string_id int

    You might also store the string, denormalized, in the main table as well.

    For lookups:

    SELECT ...

    FROM main_table

    WHERE

    string_id IN (SELECT string_id FROM strings WHERE string IN ('<list of string values to lookup'>))

    Edit: This code is intended only for use IF you decide to use an encoded number to represent the string value in the lead column of the clustering index. If you use the actual string itself, just code the WHERE condition normally:

    WHERE string IN (...list_of_string_values_to_match...)

    Curious, but where do you get this:

    Clustering on identity is often very damaging to overall performance.

    I know this was true with SQL Server 6.x and maybe 7.x. When SQL Server introduce row level locking the issue of a hot spot during inserts was significantly reduced.

  • Relatively static data can be clustered on any column that fits your querying criteria. However if you deal with a high transaction level OLTP where new records associated with the primary read-filtering column are added at high rates, you want to make sure you don't degrade write performance. As of today, I've been doing this by making the clustered index an ever-increasing column, be that an int or bigint identity or GUID column. I'm open to suggestions that don't degrade write performance and improves read performance.

  • Michael Valentine Jones (2/26/2015)


    I have done this in order to avoid having to create an index on the string column.

    Since this column is a unique identifier, you would probably still need a unique index or or unique constraint, so I doubt you would see much benefit.

    This is what I have done so far. Checksum doesn't guarantee uniqueness, but is quite fast. I added an index on the computed checksum column (thus the need for a persistent computed column). Depending on the scenario, I would add the string itself in the "include" portion of the index.

  • My concern would be how the function is applied to arrive at the checksum. But I'm with everyone else, why not just let them hit the string. Also, I'd consider looking at that as the clustered index since it's unique. You're saying it's the most commonly used access path to the data. It might be the best choice for the cluster. Is the identity column used for searches, or just as a unique value?

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • N_Muller (2/26/2015)


    I have tables in database where a VARCHAR(50) string is unique identifier. The database currently has an integer identity column as clustered primary key, and there's a non-clustered index on string column. The customer always queries based on a defined set of the identifier (string) column.

    I wonder if someone sees an advantage of adding a persisted computed column to the table as the checksum of the string column, and then create a non-clustered index on the checksum and the string. When a customer requests data, we would compute the checksum of the customer provided identifier and add to the where clause or join, that the checksum and string must match.

    Will SQL Server perform checksum check (integer) and only if it succeeds, perform the string check, in which case I see an advantage of added the checksum column? Or will SQL Server always check for both the checksum and string, in which case the additional column only adds unnecessary overhead? To note is the fact that the table(s) will have millions of rows, but the customer will request data for at at most, 100 or so identifiers.

    Let's back the performance cart up a bit here. You say the data in the string column is a "unique identifier". What precisely does that mean? Is it a GUID? If so, then it should absolutely NOT be the clustered index. Is the string ever increasing in value? If not, this it should probably not be the clustered index here, either.

    What kind of data does this string have in it?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Lynn Pettis (2/26/2015)


    ScottPletcher (2/26/2015)


    Either cluster directly on the string itself. Or, if you don't want to do, encode the string into an int (or smallint if you have less than ~20,000 values) and cluster on that. Clustering on identity is often very damaging to overall performance.

    For example:

    CREATE TABLE dbo.strings (

    string_id int IDENTITY(1, 1),

    string varchar(50)

    )

    In the main table:

    string_id int

    You might also store the string, denormalized, in the main table as well.

    For lookups:

    SELECT ...

    FROM main_table

    WHERE

    string_id IN (SELECT string_id FROM strings WHERE string IN ('<list of string values to lookup'>))

    Edit: This code is intended only for use IF you decide to use an encoded number to represent the string value in the lead column of the clustering index. If you use the actual string itself, just code the WHERE condition normally:

    WHERE string IN (...list_of_string_values_to_match...)

    Curious, but where do you get this:

    Clustering on identity is often very damaging to overall performance.

    I know this was true with SQL Server 6.x and maybe 7.x. When SQL Server introduce row level locking the issue of a hot spot during inserts was significantly reduced.

    Not for the INSERTs, for the subsequent reads. Identity generally works very well for INSERTs, but may force large numbers of "covering" nonclustered indexes and/or many key lookup queries where a clustering key that better matched the data queries would result in much more efficient lookups. And the ratio of SELECTs to INSERTs is often 1000x or more.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • ScottPletcher (2/27/2015)


    Lynn Pettis (2/26/2015)


    ScottPletcher (2/26/2015)


    Either cluster directly on the string itself. Or, if you don't want to do, encode the string into an int (or smallint if you have less than ~20,000 values) and cluster on that. Clustering on identity is often very damaging to overall performance.

    For example:

    CREATE TABLE dbo.strings (

    string_id int IDENTITY(1, 1),

    string varchar(50)

    )

    In the main table:

    string_id int

    You might also store the string, denormalized, in the main table as well.

    For lookups:

    SELECT ...

    FROM main_table

    WHERE

    string_id IN (SELECT string_id FROM strings WHERE string IN ('<list of string values to lookup'>))

    Edit: This code is intended only for use IF you decide to use an encoded number to represent the string value in the lead column of the clustering index. If you use the actual string itself, just code the WHERE condition normally:

    WHERE string IN (...list_of_string_values_to_match...)

    Curious, but where do you get this:

    Clustering on identity is often very damaging to overall performance.

    I know this was true with SQL Server 6.x and maybe 7.x. When SQL Server introduce row level locking the issue of a hot spot during inserts was significantly reduced.

    Not for the INSERTs, for the subsequent reads. Identity generally works very well for INSERTs, but may force large numbers of "covering" nonclustered indexes and/or many key lookup queries where a clustering key that better matched the data queries would result in much more efficient lookups. And the ratio of SELECTs to INSERTs is often 1000x or more.

    I am going to say "it depends." You also have to look at the environment you are running as well. One thing to look at is if you are also mirroring to an HR site over a slow WAN connection. If you have to rebuild a clustered index in that situation you could flood the wan connection.

  • Lynn Pettis (2/27/2015)


    ScottPletcher (2/27/2015)


    Lynn Pettis (2/26/2015)


    ScottPletcher (2/26/2015)


    Either cluster directly on the string itself. Or, if you don't want to do, encode the string into an int (or smallint if you have less than ~20,000 values) and cluster on that. Clustering on identity is often very damaging to overall performance.

    For example:

    CREATE TABLE dbo.strings (

    string_id int IDENTITY(1, 1),

    string varchar(50)

    )

    In the main table:

    string_id int

    You might also store the string, denormalized, in the main table as well.

    For lookups:

    SELECT ...

    FROM main_table

    WHERE

    string_id IN (SELECT string_id FROM strings WHERE string IN ('<list of string values to lookup'>))

    Edit: This code is intended only for use IF you decide to use an encoded number to represent the string value in the lead column of the clustering index. If you use the actual string itself, just code the WHERE condition normally:

    WHERE string IN (...list_of_string_values_to_match...)

    Curious, but where do you get this:

    Clustering on identity is often very damaging to overall performance.

    I know this was true with SQL Server 6.x and maybe 7.x. When SQL Server introduce row level locking the issue of a hot spot during inserts was significantly reduced.

    Not for the INSERTs, for the subsequent reads. Identity generally works very well for INSERTs, but may force large numbers of "covering" nonclustered indexes and/or many key lookup queries where a clustering key that better matched the data queries would result in much more efficient lookups. And the ratio of SELECTs to INSERTs is often 1000x or more.

    I am going to say "it depends." You also have to look at the environment you are running as well. One thing to look at is if you are also mirroring to an HR site over a slow WAN connection. If you have to rebuild a clustered index in that situation you could flood the wan connection.

    But this is necessary only because it was improperly clustered to begin with. All the more reason to never allow a "default" clustering key of identity, and instead always base the clustering key on the that specific table's actual usage.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

Viewing 15 posts - 1 through 15 (of 30 total)

You must be logged in to reply to this topic. Login to reply