(last updated: 2019-03-15 @ 02:45 EST / 2019-03-15 @ 06:45 UTC )
Binary collations are, in many ways, simpler to understand than non-binary collations because they don’t have any complex, culture-based linguistic rules. They just work with the character values and there is little room for mystery: all characters have underlying values; if two characters have the same underlying value then they are equal, else they are not; and the underlying value is the order, right?
Still, there are some complexities related to binary collations that you might not be aware of. To figure out what they are, we need to look at why there are so many binary collations in the first place. I mean, binary collations work on the underlying values of the characters, and comparing numbers doesn’t change between cultures or versions: 12 = 12, 12 > 11, and 12 <13, always. So, then what is the difference between:
Latin1_General_100_BIN2
andHebrew_100_BIN2
(only the culture is different), orLatin1_General_100_BIN2
andLatin1_General_BIN2
(only the version is different), orLatin1_General_100_BIN2
andLatin1_General_100_BIN
(only the binary comparison type is different)
Let’s take a closer look at each of those three cases.
Different Cultures
From the names alone we can see that Latin1_General_100_BIN2
and Hebrew_100_BIN2
are both “_100_BIN2
” collations, and the difference between them is just “Latin1_General
” vs “Hebrew
” (i.e. the culture). For NVARCHAR
data this difference is entirely meaningless: the behavior of all binary collations of the same version and same binary comparison type is exactly the same.
For VARCHAR
data, however, there is an important difference. All collations have an associated code page that is the character set used for 8-bit (i.e. VARCHAR
) data. Even if that code page is 0
, meaning “no code page”, it still determines what characters can be stored in 8-bit datatypes. And in that case, it would be “none”, which is why these collations are called “Unicode-only”, and why they don’t work with VARCHAR
data:
SELECT COLLATIONPROPERTY(N'Syriac_100_BIN2', 'CodePage'); -- 0 SELECT 'not gonna work, trust me ;-)' COLLATE Syriac_100_BIN2; /* Msg 459, Level 16, State 1, Line XXXXX Collation 'Syriac_100_BIN2' is supported on Unicode data types only and cannot be applied to char, varchar or text data types. */
No code page means that there is no 8-bit (i.e. VARCHAR
) character set. Of course, the difference in culture will be more obvious if we look at a few non-zero code pages:
-- Which 8-bit code page is used for each of these collations: SELECT COLLATIONPROPERTY(N'Latin1_General_100_BIN2', 'CodePage'), COLLATIONPROPERTY(N'Hebrew_100_BIN2', 'CodePage'), COLLATIONPROPERTY(N'Korean_100_BIN2', 'CodePage'); -- 1252 1255 949
Now we will inject the same underlying values into each of those three code pages. Please note that we need to use values in the range of 128 – 255 (0x80 – 0xFF) since the 0 – 127 range (0x00 – 0x7F) is the same across all of the code pages that can be represented in SQL Server, including Unicode data (both UTF-16 via NVARCHAR
and, starting in SQL Server 2019, UTF-8 via VARCHAR
).
DECLARE @Data TABLE ( [CodePage-1252] VARCHAR(50) COLLATE Latin1_General_100_BIN2, [CodePage-1255] VARCHAR(50) COLLATE Hebrew_100_BIN2, [CodePage-949] VARCHAR(50) COLLATE Korean_100_BIN2 ); INSERT INTO @Data VALUES (0xE1FA, 0xE1FA, 0xE1FA); INSERT INTO @Data VALUES (0xE4E8, 0xE4E8, 0xE4E8); SELECT * FROM @Data; /* CodePage-1252 CodePage-1255 CodePage-949 áú בת 守 äè הט 怏 */
As you can see in the results, the exact same two bytes — 0xE1
and 0xFA
— produce very different characters due to the code pages being different (code page 949 is a Double-Byte Character Set (DBCS) which is why there is only a single character returned for the Korean collation). And the same is true for the 0xE4
and 0xE8
bytes. This is the only reason for having more than a single “_100_BIN2
” collation. And, this is also why Microsoft could have simply provided a single UnicodeOnly_90_BIN2
collation and a single UnicodeOnly_100_BIN2
collation instead of the 15 that exist across both of those versions:
SELECT ROW_NUMBER() OVER ( PARTITION BY CASE WHEN col.[name] LIKE N'%[_]90[_]%' THEN 1 ELSE 2 END ORDER BY CASE WHEN col.[name] LIKE N'%[_]90[_]%' THEN 1 ELSE 2 END, col.[name] ), col.[name] FROM sys.fn_helpcollations() col WHERE col.[name] LIKE N'%[_]BIN2' AND COLLATIONPROPERTY(col.[name], 'CodePage') = 0; /* 1 Divehi_90_BIN2 2 Indic_General_90_BIN2 3 Syriac_90_BIN2 1 Assamese_100_BIN2 2 Bengali_100_BIN2 3 Divehi_100_BIN2 4 Indic_General_100_BIN2 5 Khmer_100_BIN2 6 Lao_100_BIN2 7 Maltese_100_BIN2 8 Maori_100_BIN2 9 Nepali_100_BIN2 10 Pashto_100_BIN2 11 Syriac_100_BIN2 12 Tibetan_100_BIN2 */
Different Versions
From the names alone we can see that Latin1_General_100_BIN2
and Latin1_General_BIN2
are both “Latin1_General_*_BIN2
” collations, and the difference between them is just “{no number ; implied 80}” vs “100
” (i.e. the version). For VARCHAR
data this difference is entirely meaningless: the behavior of all binary collations of the same culture is exactly the same.
For Unicode data (mainly NVARCHAR
, but starting in SQL Server 2019 CTP 2.3, also VARCHAR
if using the new UTF8_BIN2
collation), however, there is an important difference. All collations map code points to various properties such as: sort weights (used for sorting and comparisons), numerical values for “numbers”, uppercase and lowercase versions, etc. For example:
SELECT 1 WHERE N'½' COLLATE Latin1_General_100_CI_AS LIKE N'[3-5]'; -- {no rows returned} SELECT 2 WHERE N'½' COLLATE Latin1_General_100_CI_AS LIKE N'[0-1]'; -- 2
But, it is those uppercase and lowercase mappings that are important here. There are quite a few code points that are missing these mappings, but as each new version of collations is introduced, more and more of the code points are having their mappings added. For example, the following shows a code point that is missing its uppercase mapping in both version 80 and 90 collations (if it’s missing in any particular version, then it’s missing in all prior versions as well, which is why I don’t need a field for the version 80 collation):
SELECT NCHAR(0x1FF3) AS [TheChar], UPPER(NCHAR(0x1FF3) COLLATE Japanese_90_BIN2) AS [v90], UPPER(NCHAR(0x1FF3) COLLATE Japanese_XJIS_100_BIN2) AS [v100], UPPER(NCHAR(0x1FF3) COLLATE Japanese_XJIS_140_BIN2) AS [v140]; /* TheChar v90 v100 v140 ῳ ῳ ῼ ῼ */
And, the following shows a code point that is missing its uppercase mapping in version 80, 90, and 100 collations:
SELECT NCHAR(0x0250) AS [TheChar], UPPER(NCHAR(0x0250) COLLATE Japanese_90_BIN2) AS [v90], UPPER(NCHAR(0x0250) COLLATE Japanese_XJIS_100_BIN2) AS [v100], UPPER(NCHAR(0x0250) COLLATE Japanese_XJIS_140_BIN2) AS [v140]; /* TheChar v90 v100 v140 ɐ ɐ ɐ Ɐ */
Ok, so how many of these mappings are missing from each version? I’m glad you asked 😉
The only way we have to determine which mappings are missing is by looking at the differences between the oldest version (highest number of missing mappings) and newest version (least number of missing mappings) collations. Logically, this means that since there are four versions — 80, 90, 100, and 140 — the version 80 data will naturally be “all missing” while the version 140 data will naturally be “none missing”. There might be additional mappings missing from all versions, but a) we have no way to determine that now as that requires cross-referencing the official Unicode Character Database, which is possible but way out of scope, and b) any potential missing mappings aren’t relevant to the question of “what are the differences between the existing collations available in SQL Server”.
The query below finds all code points that return the same character that was passed into the LOWER
function when using a version 80 collation, but that also return a different character when the same initial character was passed in, but this time using a version 140 collation. For the version 90 and 100 collations, if the LOWER
function returns a character that is different from what was passed in, the lowercase character is displayed, else it will be NULL
. The final row is a summary line showing the totals. A field for the version 80 collations is included just to show clearly that there is no lowercase mapping for those characters.
;WITH nums AS ( SELECT TOP (65534) ROW_NUMBER() OVER(ORDER BY @@SPID) AS [num] FROM master.sys.all_columns ac1 CROSS JOIN master.sys.all_columns ac2 ), chars AS ( SELECT nums.[num] AS [CodePoint], CONVERT(BINARY(2), nums.[num]) AS [CodePointHex], NCHAR(nums.[num]) AS [ TheChar ], LOWER(NCHAR(nums.[num]) COLLATE Japanese_BIN2) AS [v80], ' ' AS [----], IIF(LOWER(NCHAR(nums.[num]) COLLATE Japanese_90_BIN2) <> NCHAR(nums.[num]), LOWER(NCHAR(nums.[num]) COLLATE Japanese_90_BIN2), NULL) AS [v90], IIF(LOWER(NCHAR(nums.[num]) COLLATE Japanese_XJIS_100_BIN2) <> NCHAR(nums.[num]), LOWER(NCHAR(nums.[num]) COLLATE Japanese_XJIS_100_BIN2), NULL) AS [v100], IIF(LOWER(NCHAR(nums.[num]) COLLATE Japanese_XJIS_140_BIN2) <> NCHAR(nums.[num]), LOWER(NCHAR(nums.[num]) COLLATE Japanese_XJIS_140_BIN2), NULL) AS [v140] FROM nums WHERE LOWER(NCHAR(nums.[num]) COLLATE Japanese_BIN2) = NCHAR(nums.[num]) -- mapping not in version 80 collations AND LOWER(NCHAR(nums.[num]) COLLATE Japanese_XJIS_140_BIN2) <> NCHAR(nums.[num]) -- mapping in version 140 collations ) SELECT * FROM chars UNION ALL SELECT NULL, NULL, N'TOTALS:', CONVERT(NVARCHAR(5), COUNT(chars.[v80])), ' ' AS [----], CONVERT(NVARCHAR(5), COUNT(chars.[v90])), CONVERT(NVARCHAR(5), COUNT(chars.[v100])), CONVERT(NVARCHAR(5), COUNT(chars.[v140])) FROM chars; /* CodePoint HexValue TheChar v80 -- v90 v100 v140 ... 502 0x01F6 Ƕ Ƕ NULL ƕ ƕ ... 8498 0x2132 Ⅎ Ⅎ NULL ⅎ ⅎ ... 11374 0x2C6E Ɱ Ɱ NULL NULL ɱ ... 42820 0xA744 Ꝅ Ꝅ NULL NULL ꝅ ... NULL NULL TOTALS: 305 0 200 305 */
I have included a random(ish) sampling of 4 rows from the result set above, plus the final / summary line. The summary shows that there are 305 characters missing their lowercase mapping in the version 80 collations. The version 90 collations add no mappings (sad). But, the version 100 collations add 200 mappings, and the version 140 collations added another 105 mappings, giving it 305 mappings that are missing from the version 80 and 90 collations.
Interesting, but we aren’t done yet. The lowercase mappings are just one direction. We need to include the uppercase mappings as well. And, while it’s tempting, we should not assume that the missing uppercase mappings are merely the uppercase versions of the characters that are missing their lowercase mappings. Each code point has an uppercase or lowercase mapping (or can have), and these are completely independent, so just because one entry was added doesn’t mean that any other mapping has been added, even if it seems like it should have been.
The query below finds all code points that return the same character that was passed into the UPPER
function when using a version 80 collation, but that also return a different character when the same initial character was passed in, but this time using a version 140 collation. For the version 90 and 100 collations, if the UPPER
function returns a character that is different from what was passed in, the uppercase character is displayed, else it will be NULL
. The final row is a summary line showing the totals. A field for the version 80 collations is included just to show clearly that there is no uppercase mapping for those characters.
;WITH nums AS ( SELECT TOP (65534) ROW_NUMBER() OVER(ORDER BY @@SPID) AS [num] FROM master.sys.all_columns ac1 CROSS JOIN master.sys.all_columns ac2 ), chars AS ( SELECT nums.[num] AS [CodePoint], CONVERT(BINARY(2), nums.[num]) AS [CodePointHex], NCHAR(nums.[num]) AS [ TheChar ], UPPER(NCHAR(nums.[num]) COLLATE Japanese_BIN2) AS [v80], ' ' AS [----], IIF(UPPER(NCHAR(nums.[num]) COLLATE Japanese_90_BIN2) <> NCHAR(nums.[num]), UPPER(NCHAR(nums.[num]) COLLATE Japanese_90_BIN2), NULL) AS [v90], IIF(UPPER(NCHAR(nums.[num]) COLLATE Japanese_XJIS_100_BIN2) <> NCHAR(nums.[num]), UPPER(NCHAR(nums.[num]) COLLATE Japanese_XJIS_100_BIN2), NULL) AS [v100], IIF(UPPER(NCHAR(nums.[num]) COLLATE Japanese_XJIS_140_BIN2) <> NCHAR(nums.[num]), UPPER(NCHAR(nums.[num]) COLLATE Japanese_XJIS_140_BIN2), NULL) AS [v140] FROM nums WHERE UPPER(NCHAR(nums.[num]) COLLATE Japanese_BIN2) = NCHAR(nums.[num]) -- mapping not in version 80 collations AND UPPER(NCHAR(nums.[num]) COLLATE Japanese_XJIS_140_BIN2) <> NCHAR(nums.[num]) -- mapping in version 140 collations ) SELECT * FROM chars UNION ALL SELECT NULL, NULL, N'TOTALS:', CONVERT(NVARCHAR(5), COUNT(chars.[v80])), ' ' AS [----], CONVERT(NVARCHAR(5), COUNT(chars.[v90])), CONVERT(NVARCHAR(5), COUNT(chars.[v100])), CONVERT(NVARCHAR(5), COUNT(chars.[v140])) FROM chars; /* CodePoint HexValue TheChar v80 -- v90 v100 v140 ... 1165 0x048D ҍ ҍ NULL Ҍ Ҍ ... 8097 0x1FA1 ᾡ ᾡ NULL ᾩ ᾩ ... 42805 0xA735 ꜵ ꜵ NULL NULL Ꜵ ... 42849 0xA761 ꝡ ꝡ NULL NULL Ꝡ ... NULL NULL TOTALS: 344 0 238 344 */
I have included a random(ish) sampling of 4 rows from the result set above, plus the final / summary line. The summary shows that there are 344 characters missing their uppercase mapping in the version 80 collations. The version 90 collations add no mappings (sad). But, the version 100 collations add 238 mappings, and the version 140 collations added another 106 mappings, giving it 305 mappings that are missing from the version 80 and 90 collations.
Putting the data from both queries together, the total totals are:
- Version 80: missing 649 mappings (305 + 344)
- Version 90: missing 649 mappings (305 + 344)
- Version 100: added 438 mappings (200 + 238) over versions 80 and 90 ; only missing 211 mappings (105 + 106, or 649 – 438)
- Version 140: added 211 mappings (105 + 106) over version 100 ; added all 649 over versions 80 and 90
This data shows why:
- if possible, it is important to use the newest version available (version 100 was introduced in SQL Server 2008, 140 was introduced in SQL Server 2017)! Please keep in mind that these mappings are just one difference. There are also many thousands of sort weights missing from the earlier versions (almost 20k missing from version 80, almost 13k missing from version 90, and almost 5k missing from version 100).
- the
UTF8_BIN2
collation, new in CTP 2.3 of SQL Server 2019, is better than having no binary UTF-8 collation (which was the case prior to CTP 2.3), however, being a version 80 collation, it’s missing 649 uppercase / lowercase mappings 😿
Different Binary Comparison Types
From the names alone we can see that Latin1_General_100_BIN2
and Latin1_General_100_BIN
are both “Latin1_General_100
” collations, and the difference between them is just “_BIN
” vs “_BIN2
” (i.e. the binary comparison type). For non-UTF-8 VARCHAR
data this difference is entirely meaningless: the behavior of all binary collations of the same culture is exactly the same.
For Unicode data (mainly NVARCHAR
, but starting in SQL Server 2019 CTP 2.3, also VARCHAR
if using the new UTF8_BIN2
collation), however, there is an important difference. When comparing values, the behavior of all binary collations is exactly the same. So, what we are interested in is sorting.
According to the Microsoft documentation page for Collation and Unicode Support:
In a BIN2 collation all characters are sorted according to their code points. In a BIN collation only the first character is sorted according to the code point, and remaining characters are sorted according to their byte values. (Because the Intel platform is a little endian architecture, Unicode code characters are always stored byte-swapped.)
That means: the behavior is exactly the same for the first character only. So, if you are testing the difference between “_BIN
” and “_BIN2
” by sorting a list of single characters, then you will never see a difference. In order to see the difference in behavior, you need to skip the first character. The following test shows this by providing both a field with a single character, and a field with a prefix character that is the same for both rows, cancelling out the effect of the first character on the sorting:
SELECT tab.[ID], tab.[col] AS [TheChar], N'_' + tab.[col] AS [PrefixedChar], '--' AS [--], ROW_NUMBER() OVER (ORDER BY tab.[col] COLLATE Latin1_General_100_BIN) AS [TheCharBIN], ROW_NUMBER() OVER (ORDER BY N'_' + tab.[col] COLLATE Latin1_General_100_BIN) AS [PrefixedBIN], '--' AS [--], ROW_NUMBER() OVER (ORDER BY tab.[col] COLLATE Latin1_General_100_BIN2) AS [TheCharBIN2], ROW_NUMBER() OVER (ORDER BY N'_' + tab.[col] COLLATE Latin1_General_100_BIN2) AS [PrefixedBIN2] FROM (VALUES (1, NCHAR(0x1225)), (2, NCHAR(0x2113))) tab([ID], [col]) ORDER BY tab.[ID]; /* ID Char Prefixed -- Char Prefixed -- Char Prefixed BIN BIN BIN2 BIN2 1 ሥ _ሥ -- 1 2 -- 1 1 2 ℓ _ℓ -- 2 1 -- 2 2 */
The “PrefixedBIN” field shows a difference because starting at byte #3, the “_BIN
” collation sees 0x25
and 0x13
(due to the byte-swapping of the Little Endian encoding), while the “_BIN2
” collation sees 0x12
and 0x21
. If that is not clear, maybe the following chart will help. And keep in mind we are working with the “Prefixed” value that effectively skips the first character since there is no sorting difference with that first position. The chart below shows the character and its underlying bytes starting at character position / index #2.
CodePoint | Character | UTF-16 LE | BIN ordering | BIN2 ordering |
---|---|---|---|---|
U+1225 | ሥ | 0x2512 | 0x25 then 0x12 | 0x12 then 0x25 |
U+2113 | ℓ | 0x1321 | 0x13 then 0x21 | 0x21 then 0x13 |
Supplementary Characters
This next query shows how Supplementary Characters are handled across “_BIN
“, “_BIN2
” collations, as well as non-binary collations of version 90 or higher:
SELECT tab.[ID], tab.[col] AS [TheChar], N'_' + tab.[col] AS [PrefixedChar], '---' AS [---], ROW_NUMBER() OVER (ORDER BY tab.[col] COLLATE Latin1_General_100_BIN) AS [Char_BIN], ROW_NUMBER() OVER (ORDER BY N'_' + tab.[col] COLLATE Latin1_General_100_BIN) AS [Prefixed_BIN], '---' AS [---], ROW_NUMBER() OVER (ORDER BY tab.[col] COLLATE Latin1_General_100_BIN2) AS [Char_BIN2], ROW_NUMBER() OVER (ORDER BY N'_' + tab.[col] COLLATE Latin1_General_100_BIN2) AS [Prefixed_BIN2], '---' AS [---], ROW_NUMBER() OVER (ORDER BY tab.[col] COLLATE Japanese_90_CI_AS) AS [Char_CI_AS], ROW_NUMBER() OVER (ORDER BY N'_' + tab.[col] COLLATE Japanese_90_CI_AS) AS [Prefixed_CI_AS] FROM (VALUES (1, NCHAR(0xFB06)), (2, NCHAR(0xD802) + NCHAR(0xDC2C)), (3, NCHAR(0xD83D) + NCHAR(0xDE43))) tab([ID], [col]) ORDER BY tab.[ID];
The query above returns the following 3 rows:
ID | Char | Prefixed | BIN | BIN2 | 90_CI_AS | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Char | Prefixed | Char | Prefixed | Char | Prefixed | ||||||
1 | st | _st | 3 | 2 | 3 | 3 | 1 | 1 | |||
2 | 𐠬 | _𐠬 | 1 | 1 | 1 | 1 | 2 | 2 | |||
3 | 🙃 | _🙃 | 2 | 3 | 2 | 2 | 3 | 3 |
The reason for including this query is to highlight two related nuances of working with UTF-16 (i.e. the Unicode encoding used by NVARCHAR
data): one of terminology, and the other of behavior.
Looking at the order number columns (the six right-most columns), we can observe the following:
- Both the prefixed and non-prefixed columns for “
_BIN2
” sorted in the same order. - The non-prefixed characters for both “
_BIN
” and “_BIN2
” sorted in the same order. - The non-prefixed and prefixed characters for “
_BIN
” did not sort in the same order.
So far, all of those behaviors are expected and consistent with what we have seen previously in this post.
But, the non-binary, version 90 collation columns (the right-most two columns) show an ordering that is both:
- the same between the non-prefixed and prefixed columns, yet
- different than either of the two other sort orders seen in the columns to the left
This difference in sort order exposes a technicality that is not often mentioned, probably because most of the time there is little practical benefit in being this technical. By now you are probably aware that the underlying values used in Unicode to uniquely identify each character are called “code points”. A code point, can be encoded in various ways. The Unicode specification provides five options for encoding: UTF-8, UTF-16 Big Endian, UTF-16 Little Endian, UTF-32 Big Endian, and UTF-32 Little Endian (I am leaving out the non-Endian variations of UTF-16 and UTF-32 since they merely imply Big Endian). UTF-8 encodes each code point as a single entity of between 1 and 4 bytes. UTF-32 encodes each code point as a single entity of 4 bytes.
In contrast, UTF-16 encodes each code point as either one or two 2-byte entities. For the first 65,536 code points (the total number of combinations of 2 bytes, 0x0000
– 0xFFFF
), everything said so far still holds true. But, things get a little confusing when dealing with Supplementary Characters (i.e. all characters outside of the initial 65,536). Supplementary Characters, like all other characters, each have a unique code point. In UTF-8 and UTF-32 there is no difference, but in UTF-16 each Supplementary Character is comprised of two non-Supplementary Characters (i.e. a “surrogate pair”). So what does the term “code point” refer to? Does it refer to the actual Supplementary Character, or does it refer to each of the two-byte pieces that make up the Supplementary Character? It can’t be both. And it isn’t.
In UTF-16, the two-byte entities are actually known as “code units“. And, it just so happens that the first 65,536 code points are each made up of a single code unit having the same value (e.g. code point U+FB06 is code unit FB06
). Supplementary Characters are code points comprised of two code units (e.g. code point U+1F643 is comprised of code units D83D
and DE43
).
Looking back at the Microsoft documentation quote at the beginning of this section, we can now see that the statement, “In a BIN2 collation all characters are sorted according to their code points”, is not entirely accurate. Sure, for non-Supplementary Characters that is true, but dealing with Supplementary Characters shows us that the definition of BIN2 collations is technically incorrect. The BIN2 collations, when dealing with NVARCHAR
data, sort by code unit, not by code point.
Am I being unfair, or overly nit-picky? No, because the query and results above show that there are three ways of sorting these values, and it is not possible to explain this behavior without understanding the difference between code unit and code point. Hopefully the chart below will help visualize the difference in these three sorting methods. This chart shows the same three characters in the same progression as the results above: BIN
, then BIN2
, and then non-binary-version-90-or-newer. But this time I’m showing how the sorting algorithm saw each character, which will help explain why each approach did something different.
- Binary (
_BIN
): this algorithm goes byte-by-byte (after the first character, so after the first two bytes). And, since we are working on a Little Endian platform, the bytes of each base entity are reversed. A UTF-16 entity (i.e. code unit) is always 2 bytes. The value in parenthesis in the “Binary” column is the reversed first code unit, which is why the first byte in the first row is0x02
instead of0xD8
, and the first byte in the second row is0x06
instead of0xFB
. - Code Unit (
_BIN2
): this algorithm goes code unit-by-code unit. A code unit’s value stays the same no matter how it is encoded. So the number of bytes and the order that they are in physically does not matter. There is no need to show a different representation in parenthesis in the “CodeUnit” column becauseD802
isD802
, even if that code unit is stored as0x02D8
on disk. - Code Point (non-binary-version-90-or-newer): this algorithm goes code point-by-code point (for Supplementary Characters only; otherwise non-binary collations use culture-specific linguistic rules). The value in parenthesis in the “CodePoint” column is the actual code point, and that is the value used for sorting (not the bytes or the code units).
Binary (_BIN ) | CodeUnit (_BIN2 ) | CodePoint (_90_CI_AS ) |
---|---|---|
0xD802DC2C (02D8) | 0xD802DC2C | 0xFB06 (U+FB06) |
0xFB06 (06FB) | 0xD83DDE43 | 0xD802DC2C (U+1082C) |
0xD83DDE43 (3DD8) | 0xFB06 | 0xD83DDE43 (U+1F643) |
Given the description of the “_BIN2
” collations as being “code point” ordered yet they are actually “code unit” ordered, which one is correct? According to the following quote from the official Unicode standard (i.e. the specification) in Chapter 2: General Structure / Section 2.5: Encoding Forms (page 39):
All three encoding forms give the same results for binary string comparisons or string sorting when dealing only with BMP characters (in the range U+0000..U+FFFF). However, when dealing with supplementary characters (in the range U+10000..U+10FFFF), UTF-16 binary order does not match Unicode code point order.
it would appear that the behavior is correct, and the documentation should probably be updated.
UTF-8 Bonus Round
Starting in SQL Server 2019 it’s possible to use the UTF-8 encoding.
The Unicode specification (same section / page as quoted directly above) states:
A binary sort of UTF-8 strings gives the same ordering as a binary sort of Unicode code points. This is obviously the same order as for a binary sort of UTF-32 strings.
We can test with the following query:
SELECT tab.[ID], tab.[col] AS [TheCharNVC], CONVERT(VARCHAR(10), tab.[col] COLLATE UTF8_BIN2) AS [TheCharVC], '--' AS [--], ROW_NUMBER() OVER (ORDER BY tab.[col] COLLATE UTF8_BIN2) AS [CharNVC_UTF8_BIN2], ROW_NUMBER() OVER (ORDER BY CONVERT(VARCHAR(10), tab.[col] COLLATE UTF8_BIN2)) AS [CharVC_UTF8_BIN2] FROM (VALUES (1, NCHAR(0xFB06)), (2, NCHAR(0xD802) + NCHAR(0xDC2C)), (3, NCHAR(0xD83D) + NCHAR(0xDE43))) tab([ID], [col]) ORDER BY tab.[ID];
which returns:
ID | TheCharNVC (UTF-16) | TheCharVC (UTF-8) | CharNVC_UTF8_BIN2 (UTF-16) | CharVC_UTF8_BIN2 (UTF-8) | |
---|---|---|---|---|---|
1 | st | st | 3 | 1 | |
2 | 𐠬 | 𐠬 | 1 | 2 | |
3 | 🙃 | 🙃 | 2 | 3 |
The NVARCHAR
columns show us that we are dealing with the same three characters between the UTF-8 and UTF-16 columns, that the ordering is consistent with what we saw previously for NVARCHAR
data, and that the ordering is different than the UTF-8 column. We are now interested in UTF-8 behavior, and that can only be found when working with VARCHAR
data (and using a UTF8 collation, of course). The chart above shows that yes, the UTF8_BIN2
collation (new in SQL Server 2019 CTP 2.3) does indeed use the code point value for ordering.