April 11, 2010 at 10:25 pm
Hi all,
I am surprised to discover that the default 'Latin1_General_CI_AI' collation for SQL Server 2008 has a sort order that is not what I expected.
I had expected that data would be sorted according to the standard ASCII collating sequence (eg. http://www.asciitable.com), and I now understand that I need to select a binary collation (eg. 'Latin1_General_100_BIN2') in order to get this result.
However, I haven't found any explanation of the sort sequence for the 'Latin1_General_CI_AI' collation.
As an example, if I run this T-SQL on a database with the 'Latin1_General_100_BIN2' collation:
create table tmp ( char_name char(1) not null, ascii_val smallint null )
insert into tmp ( char_name ) values ( '~' )
insert into tmp ( char_name ) values ( '!' )
insert into tmp ( char_name ) values ( '@' )
insert into tmp ( char_name ) values ( '#' )
insert into tmp ( char_name ) values ( '$' )
insert into tmp ( char_name ) values ( '%' )
insert into tmp ( char_name ) values ( '^' )
insert into tmp ( char_name ) values ( '&' )
insert into tmp ( char_name ) values ( '*' )
insert into tmp ( char_name ) values ( '(' )
insert into tmp ( char_name ) values ( ')' )
insert into tmp ( char_name ) values ( '_' )
insert into tmp ( char_name ) values ( '-' )
insert into tmp ( char_name ) values ( '+' )
insert into tmp ( char_name ) values ( '=' )
insert into tmp ( char_name ) values ( '{' )
insert into tmp ( char_name ) values ( '[' )
insert into tmp ( char_name ) values ( '}' )
insert into tmp ( char_name ) values ( ']' )
insert into tmp ( char_name ) values ( ':' )
insert into tmp ( char_name ) values ( ';' )
insert into tmp ( char_name ) values ( '"' )
insert into tmp ( char_name ) values ( '''' )
insert into tmp ( char_name ) values ( '|' )
insert into tmp ( char_name ) values ( '\' )
insert into tmp ( char_name ) values ( '<' )
insert into tmp ( char_name ) values ( ',' )
insert into tmp ( char_name ) values ( '>' )
insert into tmp ( char_name ) values ( '.' )
insert into tmp ( char_name ) values ( '?' )
insert into tmp ( char_name ) values ( '/' )
update tmp set ascii_val = ASCII(char_name)
select * from tmp order by char_name ASC
drop table tmp
I get the expected result:
!33
"34
#35
$36
%37
&38
'39
(40
)41
*42
+43
,44
-45
.46
/47
:58
;59
<60
=61
>62
?63
@64
[91
\92
]93
^94
_95
{123
|124
}125
~126
However, if I execute the same T-SQL on a database with the 'Latin1_General_CI_AI' collation, I get this result:
'39
-45
!33
"34
#35
$36
%37
&38
(40
)41
*42
,44
.46
/47
:58
;59
?63
@64
[91
\92
]93
^94
_95
{123
|124
}125
~126
+43
<60
=61
>62
Where does this strange, non-ASCII ordering come from?
April 12, 2010 at 12:27 am
Latin1_General_CI_AI is a Windows collation - SQL Server just uses a copy of the rules used by Windows. Latin1_General_CI_AI corresponds to US English (en-us) LCID 1033. You can use the COLLATIONPROPERTY built-in function to return information about a collation - such as its corresponding LCID.
The rules are, I believe, based around traditions established in particular languages and cultures over the years. Each collation is probably defined in some technical document or other, somewhere. The point is that using a given collation compares and sorts string data in the way expected by people - not according to binary values (or 'code point').
You might be more used to the way something like SQL_LATIN1_GENERAL_CP1_CI_AI sorts non-Unicode data. SQL collations are odd in that they sort non-Unicode data according to a code-page (1252 in this case) and the related Windows collation (Latin1_General_CI_AI) for Unicode data.
If you think all this is confusing, that's because it is.
CREATE TABLE #temp
(
u_char NCHAR(1) COLLATE LATIN1_GENERAL_BIN2,
a_char CHAR(1) COLLATE LATIN1_GENERAL_BIN2,
unicode_value AS UNICODE(u_char),
ascii_value AS ASCII(a_char)
);
GO
SET NOCOUNT ON;
DECLARE @index INTEGER;
SET @index = 0;
BEGIN TRANSACTION;
WHILE @index <= 255
BEGIN
INSERT #temp (u_char, a_char)
VALUES (NCHAR(@Index), CHAR(@Index));
END;
COMMIT TRANSACTION;
SELECT * FROM #temp ORDER BY a_char; -- Binary code point sort (collation of the column)
SELECT * FROM #temp ORDER BY u_char; -- Same binary code point sort (collation of the column)
-- Non-unicode characters sorted using a SQL collation
SELECT * FROM #temp ORDER BY a_char COLLATE SQL_LATIN1_GENERAL_CP1_CI_AI ASC;
-- Same characters (but unicode) sorted using a Windows collation
SELECT * FROM #temp ORDER BY u_char COLLATE LATIN1_GENERAL_CI_AI ASC;
GO
DROP TABLE #temp;
Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi
April 13, 2010 at 4:26 pm
Thanks Paul. Given that, as you way, "using a given collation compares and sorts string data in the way expected by people", I'm still not sure what "people" would expect these non-letter printable characters to be sorted in the strange sequence of the Latin1_General_CI_AI collation. But I'm happy to accept that it is what it is!
Does anyone else in the community know where to find the document that describes the sort sequence of the Latin1_General_CI_AI collation? I'm surprised that it's not referenced from BOL.
Regards,
David.
April 13, 2010 at 5:11 pm
SQLDS (4/13/2010)
Thanks Paul. Given that, as you way, "using a given collation compares and sorts string data in the way expected by people", I'm still not sure what "people" would expect these non-letter printable characters to be sorted in the strange sequence of the Latin1_General_CI_AI collation. But I'm happy to accept that it is what it is!
The best explanation I could find was that thing about the rules having been built up over centuries. I do realise that is not a hugely satisfying outcome 🙁
Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi
Viewing 4 posts - 1 through 3 (of 3 total)
You must be logged in to reply to this topic. Login to reply