February 26, 2015 at 12:07 pm
I have tables in database where a VARCHAR(50) string is unique identifier. The database currently has an integer identity column as clustered primary key, and there's a non-clustered index on string column. The customer always queries based on a defined set of the identifier (string) column.
I wonder if someone sees an advantage of adding a persisted computed column to the table as the checksum of the string column, and then create a non-clustered index on the checksum and the string. When a customer requests data, we would compute the checksum of the customer provided identifier and add to the where clause or join, that the checksum and string must match.
Will SQL Server perform checksum check (integer) and only if it succeeds, perform the string check, in which case I see an advantage of added the checksum column? Or will SQL Server always check for both the checksum and string, in which case the additional column only adds unnecessary overhead? To note is the fact that the table(s) will have millions of rows, but the customer will request data for at at most, 100 or so identifiers.
February 26, 2015 at 12:20 pm
Is there a problem just letting them query the column directly?
February 26, 2015 at 12:29 pm
There is no problem letting them query the string directly. I'm mostly interested on the performance, though.
February 26, 2015 at 1:25 pm
Either cluster directly on the string itself. Or, if you don't want to do, encode the string into an int (or smallint if you have less than ~20,000 values) and cluster on that. Clustering on identity is often very damaging to overall performance.
For example:
CREATE TABLE dbo.strings (
string_id int IDENTITY(1, 1),
string varchar(50)
)
In the main table:
string_id int
You might also store the string, denormalized, in the main table as well.
For lookups:
SELECT ...
FROM main_table
WHERE
string_id IN (SELECT string_id FROM strings WHERE string IN ('<list of string values to lookup'>))
Edit: This code is intended only for use IF you decide to use an encoded number to represent the string value in the lead column of the clustering index. If you use the actual string itself, just code the WHERE condition normally:
WHERE string IN (...list_of_string_values_to_match...)
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
February 26, 2015 at 1:40 pm
I have done this in order to avoid having to create an index on the string column.
Since this column is a unique identifier, you would probably still need a unique index or or unique constraint, so I doubt you would see much benefit.
February 26, 2015 at 1:56 pm
N_Muller (2/26/2015)
There is no problem letting them query the string directly. I'm mostly interested on the performance, though.
Well... what's the current performance. Can you post the execution plan? What's the T-SQL query. How many records do you have in that column, and how selective are they.
February 26, 2015 at 2:00 pm
Also, let's look at SQL's missing index and index usage stats for the table:
USE [<your_db_name_here>] --change to desired db if not already there
SET DEADLOCK_PRIORITY LOW --probably irrelevant, but just in case
DECLARE @list_missing_indexes bit
DECLARE @table_name_pattern sysname
--NOTE: showing missing indexes can take some time; set to 0 if you don't want to wait.
SET @list_missing_indexes = 1
SET @table_name_pattern = '<your_table_name_goes_here>' --'%'=all tables.
--SET @table_name_pattern = '%'
PRINT 'Started @ ' + CONVERT(varchar(30), GETDATE(), 120)
IF @list_missing_indexes = 1
BEGIN
SELECT
GETDATE() AS capture_date,
DB_NAME(mid.database_id) AS Db_Name,
dps.row_count,
OBJECT_NAME(mid.object_id /*, mid.database_id*/) AS Table_Name,
mid.equality_columns, mid.inequality_columns, mid.included_columns,
user_seeks, user_scans, ca1.max_days_active, unique_compiles,
last_user_seek, last_user_scan, avg_total_user_cost, avg_user_impact,
system_seeks, system_scans, last_system_seek, last_system_scan, avg_total_system_cost, avg_system_impact,
mid.statement, mid.object_id, mid.index_handle
FROM sys.dm_db_missing_index_details mid WITH (NOLOCK)
CROSS APPLY (
SELECT DATEDIFF(DAY, create_date, GETDATE()) AS max_days_active FROM sys.databases WHERE name = 'tempdb'
) AS ca1
LEFT OUTER JOIN sys.dm_db_missing_index_groups mig WITH (NOLOCK) ON
mig.index_handle = mid.index_handle
LEFT OUTER JOIN sys.dm_db_missing_index_group_stats migs WITH (NOLOCK) ON
migs.group_handle = mig.index_group_handle
LEFT OUTER JOIN sys.dm_db_partition_stats dps WITH (NOLOCK) ON
dps.object_id = mid.object_id AND
dps.index_id IN (0, 1)
--order by
--DB_NAME, Table_Name, equality_columns
WHERE
1 = 1
AND mid.database_id = DB_ID() --only current db
AND OBJECT_NAME(mid.object_id) LIKE @table_name_pattern
--AND mid.object_id IN (OBJECT_ID('<table_name_1>'), OBJECT_ID('<table_name_2>'))
ORDER BY
--avg_total_user_cost * (user_seeks + user_scans) DESC,
Db_Name, Table_Name, equality_columns, inequality_columns
END --IF
PRINT 'Midpoint @ ' + CONVERT(varchar(30), GETDATE(), 120)
-- list index usage stats (seeks, scans, etc.)
SELECT
ius2.row_num, DB_NAME() AS db_name,
CASE WHEN i.name LIKE ca2.table_name + '%'
THEN '~' + SUBSTRING(i.name, LEN(ca2.table_name) + 1, 200)
ELSE i.name END AS index_name,
CASE WHEN i.is_unique = 0 THEN 'N' ELSE 'Y' END + '.' +
CASE WHEN i.is_primary_key = 0 AND i.is_unique_constraint = 0 THEN 'N' ELSE 'Y' END AS [uniq?],
ca2.table_name,
i.index_id, --ius.user_seeks + ius.user_scans AS total_reads,
dps.row_count,
SUBSTRING(key_cols, 3, 8000) AS key_cols, SUBSTRING(nonkey_cols, 3, 8000) AS nonkey_cols,
ius.user_seeks, ius.user_scans, ius.user_lookups, ius.user_updates,
ius.last_user_seek, ius.last_user_scan, ius.last_user_lookup, ius.last_user_update,
fk.Reference_Count AS fk_ref_count,
DATEDIFF(DAY, CASE WHEN o.create_date > ca1.sql_startup_date THEN o.create_date
ELSE ca1.sql_startup_date END, GETDATE()) AS max_days_active,
FILEGROUP_NAME(i.data_space_id) AS filegroup_name,
ius.system_seeks, ius.system_scans, ius.system_lookups, ius.system_updates,
ius.last_system_seek, ius.last_system_scan, ius.last_system_lookup, ius.last_system_update
FROM sys.indexes i WITH (NOLOCK)
INNER JOIN sys.objects o WITH (NOLOCK) ON
o.object_id = i.object_id
CROSS JOIN (
SELECT create_date AS sql_startup_date FROM sys.databases WHERE name = 'tempdb'
) AS ca1
CROSS APPLY (
SELECT OBJECT_NAME(i.object_id/*, DB_ID()*/) AS table_name
) AS ca2
OUTER APPLY (
SELECT
', ' + COL_NAME(object_id, ic.column_id)
FROM sys.index_columns ic
WHERE
ic.key_ordinal > 0 AND
ic.object_id = i.object_id AND
ic.index_id = i.index_id
ORDER BY
ic.key_ordinal
FOR XML PATH('')
) AS key_cols (key_cols)
OUTER APPLY (
SELECT
', ' + COL_NAME(object_id, ic.column_id)
FROM sys.index_columns ic
WHERE
ic.key_ordinal = 0 AND
ic.object_id = i.object_id AND
ic.index_id = i.index_id
ORDER BY
COL_NAME(object_id, ic.column_id)
FOR XML PATH('')
) AS nonkey_cols (nonkey_cols)
LEFT OUTER JOIN sys.dm_db_partition_stats dps WITH (NOLOCK) ON
dps.object_id = i.object_id AND
dps.index_id = i.index_id
LEFT OUTER JOIN sys.dm_db_index_usage_stats ius WITH (NOLOCK) ON
ius.database_id = DB_ID() AND
ius.object_id = i.object_id AND
ius.index_id = i.index_id
LEFT OUTER JOIN (
SELECT
database_id, object_id, MAX(user_scans) AS user_scans,
ROW_NUMBER() OVER (ORDER BY MAX(user_scans) DESC) AS row_num --user_scans|user_seeks+user_scans
FROM sys.dm_db_index_usage_stats WITH (NOLOCK)
WHERE
database_id = DB_ID()
--AND index_id > 0
GROUP BY
database_id, object_id
) AS ius2 ON
ius2.database_id = DB_ID() AND
ius2.object_id = i.object_id
LEFT OUTER JOIN (
SELECT
referenced_object_id, COUNT(*) AS Reference_Count
FROM sys.foreign_keys
WHERE
is_disabled = 0
GROUP BY
referenced_object_id
) AS fk ON
fk.referenced_object_id = i.object_id
WHERE
i.object_id > 100 AND
i.is_hypothetical = 0 AND
i.type IN (0, 1, 2) AND
o.type NOT IN ( 'IF', 'IT', 'TF', 'TT' ) AND
(
o.name LIKE @table_name_pattern AND
o.name NOT LIKE 'dtprop%' AND
o.name NOT LIKE 'filestream[_]' AND
o.name NOT LIKE 'MSpeer%' AND
o.name NOT LIKE 'MSpub%' AND
--o.name NOT LIKE 'queue[_]%' AND
o.name NOT LIKE 'sys%'
)
--AND OBJECT_NAME(i.object_id /*, DB_ID()*/) IN ('tbl1', 'tbl2', 'tbl3')
ORDER BY
--row_count DESC,
--ius.user_scans DESC,
--ius2.row_num, --user_scans&|user_seeks
db_name, table_name,
-- list clustered index first, if any, then other index(es)
CASE WHEN i.index_id IN (0, 1) THEN 1 ELSE 2 END,
key_cols
PRINT 'Ended @ ' + CONVERT(varchar(30), GETDATE(), 120)
SET DEADLOCK_PRIORITY NORMAL
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
February 26, 2015 at 6:10 pm
ScottPletcher (2/26/2015)
Either cluster directly on the string itself. Or, if you don't want to do, encode the string into an int (or smallint if you have less than ~20,000 values) and cluster on that. Clustering on identity is often very damaging to overall performance.For example:
CREATE TABLE dbo.strings (
string_id int IDENTITY(1, 1),
string varchar(50)
)
In the main table:
string_id int
You might also store the string, denormalized, in the main table as well.
For lookups:
SELECT ...
FROM main_table
WHERE
string_id IN (SELECT string_id FROM strings WHERE string IN ('<list of string values to lookup'>))
Edit: This code is intended only for use IF you decide to use an encoded number to represent the string value in the lead column of the clustering index. If you use the actual string itself, just code the WHERE condition normally:
WHERE string IN (...list_of_string_values_to_match...)
Curious, but where do you get this:
Clustering on identity is often very damaging to overall performance.
I know this was true with SQL Server 6.x and maybe 7.x. When SQL Server introduce row level locking the issue of a hot spot during inserts was significantly reduced.
February 27, 2015 at 12:22 am
Relatively static data can be clustered on any column that fits your querying criteria. However if you deal with a high transaction level OLTP where new records associated with the primary read-filtering column are added at high rates, you want to make sure you don't degrade write performance. As of today, I've been doing this by making the clustered index an ever-increasing column, be that an int or bigint identity or GUID column. I'm open to suggestions that don't degrade write performance and improves read performance.
February 27, 2015 at 12:33 am
Michael Valentine Jones (2/26/2015)
I have done this in order to avoid having to create an index on the string column.Since this column is a unique identifier, you would probably still need a unique index or or unique constraint, so I doubt you would see much benefit.
This is what I have done so far. Checksum doesn't guarantee uniqueness, but is quite fast. I added an index on the computed checksum column (thus the need for a persistent computed column). Depending on the scenario, I would add the string itself in the "include" portion of the index.
February 27, 2015 at 6:18 am
My concern would be how the function is applied to arrive at the checksum. But I'm with everyone else, why not just let them hit the string. Also, I'd consider looking at that as the clustered index since it's unique. You're saying it's the most commonly used access path to the data. It might be the best choice for the cluster. Is the identity column used for searches, or just as a unique value?
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
February 27, 2015 at 8:53 am
N_Muller (2/26/2015)
I have tables in database where a VARCHAR(50) string is unique identifier. The database currently has an integer identity column as clustered primary key, and there's a non-clustered index on string column. The customer always queries based on a defined set of the identifier (string) column.I wonder if someone sees an advantage of adding a persisted computed column to the table as the checksum of the string column, and then create a non-clustered index on the checksum and the string. When a customer requests data, we would compute the checksum of the customer provided identifier and add to the where clause or join, that the checksum and string must match.
Will SQL Server perform checksum check (integer) and only if it succeeds, perform the string check, in which case I see an advantage of added the checksum column? Or will SQL Server always check for both the checksum and string, in which case the additional column only adds unnecessary overhead? To note is the fact that the table(s) will have millions of rows, but the customer will request data for at at most, 100 or so identifiers.
Let's back the performance cart up a bit here. You say the data in the string column is a "unique identifier". What precisely does that mean? Is it a GUID? If so, then it should absolutely NOT be the clustered index. Is the string ever increasing in value? If not, this it should probably not be the clustered index here, either.
What kind of data does this string have in it?
--Jeff Moden
Change is inevitable... Change for the better is not.
February 27, 2015 at 9:25 am
Lynn Pettis (2/26/2015)
ScottPletcher (2/26/2015)
Either cluster directly on the string itself. Or, if you don't want to do, encode the string into an int (or smallint if you have less than ~20,000 values) and cluster on that. Clustering on identity is often very damaging to overall performance.For example:
CREATE TABLE dbo.strings (
string_id int IDENTITY(1, 1),
string varchar(50)
)
In the main table:
string_id int
You might also store the string, denormalized, in the main table as well.
For lookups:
SELECT ...
FROM main_table
WHERE
string_id IN (SELECT string_id FROM strings WHERE string IN ('<list of string values to lookup'>))
Edit: This code is intended only for use IF you decide to use an encoded number to represent the string value in the lead column of the clustering index. If you use the actual string itself, just code the WHERE condition normally:
WHERE string IN (...list_of_string_values_to_match...)
Curious, but where do you get this:
Clustering on identity is often very damaging to overall performance.
I know this was true with SQL Server 6.x and maybe 7.x. When SQL Server introduce row level locking the issue of a hot spot during inserts was significantly reduced.
Not for the INSERTs, for the subsequent reads. Identity generally works very well for INSERTs, but may force large numbers of "covering" nonclustered indexes and/or many key lookup queries where a clustering key that better matched the data queries would result in much more efficient lookups. And the ratio of SELECTs to INSERTs is often 1000x or more.
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
February 27, 2015 at 10:40 am
ScottPletcher (2/27/2015)
Lynn Pettis (2/26/2015)
ScottPletcher (2/26/2015)
Either cluster directly on the string itself. Or, if you don't want to do, encode the string into an int (or smallint if you have less than ~20,000 values) and cluster on that. Clustering on identity is often very damaging to overall performance.For example:
CREATE TABLE dbo.strings (
string_id int IDENTITY(1, 1),
string varchar(50)
)
In the main table:
string_id int
You might also store the string, denormalized, in the main table as well.
For lookups:
SELECT ...
FROM main_table
WHERE
string_id IN (SELECT string_id FROM strings WHERE string IN ('<list of string values to lookup'>))
Edit: This code is intended only for use IF you decide to use an encoded number to represent the string value in the lead column of the clustering index. If you use the actual string itself, just code the WHERE condition normally:
WHERE string IN (...list_of_string_values_to_match...)
Curious, but where do you get this:
Clustering on identity is often very damaging to overall performance.
I know this was true with SQL Server 6.x and maybe 7.x. When SQL Server introduce row level locking the issue of a hot spot during inserts was significantly reduced.
Not for the INSERTs, for the subsequent reads. Identity generally works very well for INSERTs, but may force large numbers of "covering" nonclustered indexes and/or many key lookup queries where a clustering key that better matched the data queries would result in much more efficient lookups. And the ratio of SELECTs to INSERTs is often 1000x or more.
I am going to say "it depends." You also have to look at the environment you are running as well. One thing to look at is if you are also mirroring to an HR site over a slow WAN connection. If you have to rebuild a clustered index in that situation you could flood the wan connection.
February 27, 2015 at 10:46 am
Lynn Pettis (2/27/2015)
ScottPletcher (2/27/2015)
Lynn Pettis (2/26/2015)
ScottPletcher (2/26/2015)
Either cluster directly on the string itself. Or, if you don't want to do, encode the string into an int (or smallint if you have less than ~20,000 values) and cluster on that. Clustering on identity is often very damaging to overall performance.For example:
CREATE TABLE dbo.strings (
string_id int IDENTITY(1, 1),
string varchar(50)
)
In the main table:
string_id int
You might also store the string, denormalized, in the main table as well.
For lookups:
SELECT ...
FROM main_table
WHERE
string_id IN (SELECT string_id FROM strings WHERE string IN ('<list of string values to lookup'>))
Edit: This code is intended only for use IF you decide to use an encoded number to represent the string value in the lead column of the clustering index. If you use the actual string itself, just code the WHERE condition normally:
WHERE string IN (...list_of_string_values_to_match...)
Curious, but where do you get this:
Clustering on identity is often very damaging to overall performance.
I know this was true with SQL Server 6.x and maybe 7.x. When SQL Server introduce row level locking the issue of a hot spot during inserts was significantly reduced.
Not for the INSERTs, for the subsequent reads. Identity generally works very well for INSERTs, but may force large numbers of "covering" nonclustered indexes and/or many key lookup queries where a clustering key that better matched the data queries would result in much more efficient lookups. And the ratio of SELECTs to INSERTs is often 1000x or more.
I am going to say "it depends." You also have to look at the environment you are running as well. One thing to look at is if you are also mirroring to an HR site over a slow WAN connection. If you have to rebuild a clustered index in that situation you could flood the wan connection.
But this is necessary only because it was improperly clustered to begin with. All the more reason to never allow a "default" clustering key of identity, and instead always base the clustering key on the that specific table's actual usage.
SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".
Viewing 15 posts - 1 through 15 (of 30 total)
You must be logged in to reply to this topic. Login to reply