Collations

Question

Collations

richardmgreen1

SSChampion

Points: 10356
More actions
October 23, 2015 at 9:10 am

#312447

Hi all
First, some background.......
We have some data that uses codes of 5 characters.
Some of these codes appear to be identical but have a different description.
For example....
CTV3Code CTV3Text
XaIUi MRC Breathlessness Scale: grade 1
XaIui Other white British ethnic group
XaIUI Actions
These are different because of the different cases of the letters in the code (we can't change the codes as they are given to us to report on).
I want to find a way to prove/disprove a suggestion that a correctly collated column would improve performance as opposed to using the COLLATE statement in the code.
For example, we could have a query that looks like this:-
select
CTV3Code
from
tblSRCode
where
[CTV3Code]='XaIui'
Now, because the the column is currently collated as case-insensitive we would get all 3 codes as above.
If I just wanted just XaIui (the middle one), I would have to change the code to
select
CTV3Code
from
tblSRCode
where
[CTV3Code] collate Latin1_General_CS_AS ='XaIui'
(unless someone can think of another way?).
I want to prove/disprove that this is more expensive in terms of processing (CPU, etc.) than using a correctly collated column but I can't think how to do it.
Any help anyone can give on this would be greatly appreciated.

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply

Jason A. Long SSC-Insane Points: 23786 More actions · Answer 1

The only way to prove it is to test it. Put two copies of you table in a test environment one with the current collation and the other on the proposed collation and speed test your queries against both versions of the table.

That gives you the ability to go back to management and say, "Here... This took this long and this took this long..."

Alan Burstein SSC Guru Points: 61141 More actions · Answer 2

First, I'm guessing that running the query with collate Latin1_General_CS_AS will be about three times faster than ignoring case sensitivity because you are returning more records. It's also worth noting that that's not a good data model (but I know we don't always have control over this.)

All that said, if you "XaIUi" is not the same as "XaIui" then design your table that way. Set the collation in the on the column then you can avoid that collation statement in your WHERE clause. I put together some sample data and a performance test. I created too tables, one with a case sensitive column and one with a case insensitive column (relying on my default collation settings which are case insensitive).

USE tempdb

GO

SET NOCOUNT ON

GO

/******************************************************************************

1. Create a case-sensitive and case-insensitive version of the table

******************************************************************************/;

IF OBJECT_ID('dbo.tblSRCode_CI') IS NOT NULL DROP TABLE dbo.tblSRCode_CI;

IF OBJECT_ID('dbo.tblSRCode_CS') IS NOT NULL DROP TABLE dbo.tblSRCode_CS;

CREATE TABLE dbo.tblSRCode_CI

(

tID int identity primary key,

CTV3Code varchar(10) NOT NULL

);

-- Case Insensitive Nonclustered Index

CREATE INDEX nc_CTV3Code_CI ON dbo.tblSRCode_CI(CTV3Code);

CREATE TABLE dbo.tblSRCode_CS

(

tID int identity primary key,

CTV3Code varchar(10) collate Latin1_General_CS_AS NOT NULL

);

-- Case sensitive Nonclustered Index

CREATE INDEX nc_CTV3Code_CS ON dbo.tblSRCode_CS(CTV3Code);

GO

/******************************************************************************

2. Populate both tables with sample data

******************************************************************************/;

-- Populate the CI table

WITH

Nums(N) AS

(

SELECT TOP (1000000) ABS(CHECKSUM(newid())%6)+1

FROM sys.all_columns a, sys.all_columns b

),

CTV3Codes AS

(

SELECT A = ROW_NUMBER() OVER (ORDER BY newid()), CTV3Code

FROM (VALUES ('XaIUi'),('XaIui'),('XaIUI'),('BaCDX'),('bacDX'),('BacDX')) t(CTV3Code)

)

INSERT dbo.tblSRCode_CI

SELECT CTV3Code FROM Nums CROSS APPLY CTV3Codes WHERE N = A;

-- Populate the CS table

WITH

Nums(N) AS

(

SELECT TOP (1000000) ABS(CHECKSUM(newid())%6)+1

FROM sys.all_columns a, sys.all_columns b

),

CTV3Codes AS

(

SELECT A = ROW_NUMBER() OVER (ORDER BY newid()), CTV3Code

FROM (VALUES ('XaIUi'),('XaIui'),('XaIUI'),('BaCDX'),('bacDX'),('BacDX')) t(CTV3Code)

)

INSERT dbo.tblSRCode_CS

SELECT CTV3Code FROM Nums CROSS APPLY CTV3Codes WHERE N = A;

Now let's compare the query plans for these three queries:

SELECT CTV3Code

FROM dbo.tblSRCode_CI

WHERE CTV3Code = 'XaIUi'

SELECT CTV3Code

FROM dbo.tblSRCode_CI

WHERE CTV3Code = 'XaIUi' collate Latin1_General_CS_AS;

SELECT CTV3Code

FROM dbo.tblSRCode_CS

WHERE CTV3Code = 'XaIUi'

Ignore the % estimation because the optimizer has no idea that the 2nd query is returning 1/3 as many rows. The important thing to note is that the COLLATE statement is forcing an Index scan vs an index seek. The COLLATE statement is also causing an implicit conversion. The third query against the table where we set the collation on the column, is getting an index seek and it's accomplishing the same thing as the 2nd query which uses the collate statement.

Now let's check the performance. Ignore the "elapsed time" annd focus on the CPU time (I ran the queries with results turned on) You'll notice that the first query is 30% faster and generates 1/2 as many reads as the one with the collate statement even though it's returning 3X as many rows. The table that we made case sensitive (note the 3rd query) is uses 1/4th the CPU and generates 1/6th as many reads as the query with the COLLATE statement.

CI without collate (will return more records):

Table 'tblSRCode_CI'. Scan count 1, logical reads 1182, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 94 ms, elapsed time = 1414 ms.

--------------------------------------------------

CI with collate:

Table 'tblSRCode_CI'. Scan count 1, logical reads 2359, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 125 ms, elapsed time = 585 ms.

--------------------------------------------------

CS:

Table 'tblSRCode_CS'. Scan count 1, logical reads 396, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 31 ms, elapsed time = 508 ms.

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

richardmgreen1 SSChampion Points: 10356 More actions · Answer 3

Thanks folks.

I've just run the same test Alan.B described and shown it to my boss.

I got (for a million rows) 203ms, 266ms and 47ms CPU time respectively.

He's still of the opinion that case-insensitive is the way to go. Apparently, making things case-sensitive will do more harm than good.

So, on that note, thanks for the help but it looks like it's a non-starter.