(last updated: 2019-03-26 @ 23:00 EST / 2019-03-27 @ 03:00 UTC )
The Problem
I do a fair bit of research and writing about collations, encodings, Unicode, etc and have found that in order to do thorough research, I often need to make use of non-standard-ASCII characters. Sometimes this is as easy as doing:
SELECT NCHAR(0xHHHH);
where HHHH
are 4 hex digits (i.e. 0-9
and A-F
). Of course, while easy, it does take some playing around with different combinations before finding something interesting to work with.
This gets even more difficult with Supplementary Characters. These are characters with code points above the U+FFFF / 65,536 range (meaning, they are beyond the max 2-byte range). Ideally, you would just pass in that higher value to the NCHAR()
function. Unicode supports mapping up to 1,114,112 (17 “planes” of 65,536) code points (i.e. characters), though only around 200k so far have been allocated. So, just pick a value of 65536 (first code point is 0, not 1) or higher, right? Well, there’s a catch: SQL Server was initially built around UCS-2, which maps the exact same 65,536 initial code points as UTF-16. The difference is that UTF-16 has supplementary characters (the remaining 1,048,576 code points), which are comprised of two code points from the initial / lower 65,536 code points. These combinations are known as surrogate pairs, and there are 2048 code points reserved for this purpose. Both UCS-2 and UTF-16 contain the same set of 2048 code points, but UTF-16 actually uses them while in UCS-2 they are merely reserved for future use.
Getting back to the question… Why can’t we just do:
SELECT NCHAR(0x12345);
? It all depends on collation. The first several versions of collations operate from a UCS-2 perspective where there are no supplementary characters or surrogate pairs. Thankfully, collations were added in SQL Server 2012 that support supplementary characters, and their names end with “_SC
” (“SC” for Supplementary Characters”, if that was not obvious). More collations were added in SQL Server 2017, and all of them except for the binary collations support supplementary characters. But their names don’t end in “_SC
” as it is no longer a separate option that needs to be selected; starting with these newest collations, support for supplementary characters is assumed. Now you just look for “_140_
” in the name (again, excluding the binary collations).
Problem solved? Not quite. For most of the built-in string functions you can use the COLLATE
clause to force a supplementary character-aware collation:
DECLARE @Thingy NVARCHAR(10) = N'??'; SELECT @Thingy AS [U+12151], LEN(@Thingy COLLATE Latin1_General_100_CI_AS) AS [Latin1_non-SC], LEN(@Thingy COLLATE Latin1_General_100_CI_AS_SC) AS [Latin1_SC], LEN(@Thingy COLLATE Japanese_XJIS_140_CI_AS) AS [140_non-binary], LEN(@Thingy COLLATE Japanese_XJIS_140_BIN2) AS [140_binary]; /* U+12151 Latin1_non-SC Latin1_SC 140_non-binary 140_binary ?? 2 1 1 2 */
As you can see in the output shown above, the supplementary character-aware collations return a length of 1, while the other collations return a length of 2 because they are only able to see the surrogate pair as two separate code points. The character itself is stored correctly and displays correctly regardless of the collation.
But, you can only attach the COLLATE
clause to strings, and you don’t pass a string into the NCHAR()
function. So, the only way to get the NCHAR()
function to create a supplementary character is to execute it in a database that has a supplementary character-aware default collation. Well, that’s not always possible.
Fortunately there is another way to do this. Since supplementary characters are made up of surrogate pairs, we can just create the two surrogate code points. Or, we can just convert the UTF-16LE byte sequence into NVARCHAR
. For example:
SELECT NCHAR(0xD808) + NCHAR(0xDD51) AS [SurrogatePair], CONVERT(NVARCHAR(10), 0x08D851DD) AS [UTF-16LE_bytes]; /* SurrogatePair UTF-16LE_bytes ?? ?? */
Since neither of those methods requires a supplementary character-aware collation, both can be used on versions of SQL Server prior to 2012!!
Well, we solved one problem but immediately ran into another: how to find either the surrogate pair or the UTF-16LE byte sequence for any given supplementary character?
Solutions: Old and New
I used to go to the Emoticons (Emoji) 1F600—1F64F page of unicode-table.com to copy and paste characters, code points, or check the encoding chart at the bottom of each character page (the “hex” column of both “UTF-16BE” and “UTF-16LE” rows have proven most useful).
But not anymore. Now, I just hit: Ctrl + 0 . When I do that, I get a list of 188,657 code points. Each row contains the official code point value (“U+HHHH”), the integer value, the hex value (“0xHHHH”), the character itself, the UTF-16 Little Endian byte sequence (how it is actually stored, and what you get if you convert an NVARCHAR
value to VARBINARY
), the surrogate pair values, the T-SQL notation (which does not require using an _SC
or _140_
collation), the HTML notation (“&#xHHHH;”), and finally the C-style notation (“xHHHH” ; used for C / C++ / C# / Java / etc). I can copy and paste any of those values and use them in queries, emails, blog posts, .NET code, and so on.
Here are a few rows of what I get back:
Code Point | CdPnt INT | CdPnt BIN | Char | UTF-16LE | HighSrgt INT | LowSrgt INT | HighSrgt BIN | LowSrgt BIN | T-SQL | HTML | C/C++/C#/Java/? |
---|---|---|---|---|---|---|---|---|---|---|---|
U+006D | 109 | 0x006D | m | 0x6D00 | NULL | NULL | NULL | NULL | NCHAR(0x006D) | m | x006D |
U+0959 | 2393 | 0x0959 | ? | 0x5909 | NULL | NULL | NULL | NULL | NCHAR(0x0959) | ख़ | x0959 |
U+10A26 | 68134 | 0x10A26 | ?? | 0x02D826DE | 55298 | 56870 | 0xD802 | 0xDE26 | NCHAR(0xD802) + NCHAR(0xDE26) | 𐨦 | xD802xDE26 |
The Code
For Viewing All Code Points / Filtering / Sorting on Something Besides the Code Point
The code below creates an inline table-valued function (iTVF) that returns a list of Unicode code points along with various ways of looking at the value, and notations for creating the code points in various languages.
The nums
CTE simply generates the maximum addressable number of code points, starting at 0. I used @@MICROSOFTVERSION
, a system global variable, for the ORDER BY
since it returns an INT
and never changes. Some people use @@SPID
, which is shorter, but changes with every session. That probably doesn’t matter, but I was just wanting something as close to a constant as possible (constants are not allowed), for reasons that could very well be irrational (since I don’t know for certain that a differing value would cause any additional optimization time).
The chars
CTE passes along the generated number, calculates the two code points used to create supplementary characters in UTF-16, known as a surrogate pair. Surrogate pairs can only be created from combinations of a specific 2048 code points, and those 2048 code points have no other purpose (meaning, they are not valid in any way if used individually, or even out of sequence, as in a reversed pair, since some can only be first in the pair, while the others can only be second in the pair). This is why the two “…SurrogateINT” expressions return NULL
for code points under 65,536 / U+10000.
Also, the chars
CTE filters out most of the unallocated code points since there is no reason to return 925,455 rows of code points that can’t possibly be used. And yes, there are some / several code points in the 0x00000
– 0x14700
and 0x16800
– 0x30000
ranges that are unallocated, but it didn’t seem like it was worth adding lots of AND
conditions to filter several small ranges. Still, I might come back later and do that if I find that there are one or two ranges that have a significant number of unallocated rows.
The main query just puts the pieces from the chars
CTE together in different ways. The reason for having the CASE WHEN n.[num] > 65535 THEN ...
pattern in several places is for formatting. Code points are expressed in hex digits (e.g. “U+1234”) and are often created using hex digits: “x1234
” in C-style, “NCHAR(0x1234)
” in T-SQL, “ሴ
” in HTML (yes, T-SQL and HTML can also be expressed as integer values, but I find it easier to use the hex as it is more consistent, and hence makes it more likely that I will start to remember some of them as I see the same pattern of hex digits ; although, I have yet to remember the code point or surrogate pair for ?? ). Anywho, it’s easy enough to get the hex representation in T-SQL from the original row number coming from the nums
CTE, simply by doing CONVERT(BINARY(n), ...)
. And when dealing with the main BMP range (i.e. the lower 65,536 code points that are the most commonly used anyway), there are no formatting issues since code points are expressed with a minimum of 4 hex digits (e.g. “U+0042”). The problem can be seen when going above that range and working with supplementary characters. The issue is that when doing the conversion in T-SQL, any leading 0 will be included. And with the highest value for the third byte (i.e. the “xy” in “U+xy1234”) being 10, most of the time that “x” of the “xy” will be 0. That would render the code point for the “pile of poo” emoji as “01F4A9” instead of “1F4A9”. There is no functional problem here, the leading 0 doesn’t hurt anything, but it looks much nicer (and conforms to how code points are expressed everywhere else) to remove the leading 0.
For the “UTF-16LE”, “T-SQL”, and “HTML” column aliases: the extra spaces to the right of the word (e.g. [HTML{extra spaces}]
) are there for a reason. Those extra spaces enforce a minimum width for the column in the results so that as you scroll down, you won’t have to resize any of them just because the first 65,536 rows for the “T-SQL” column are “NCHAR(0xHHHH)” while the remaining 123,121 rows are “NCHAR(0xHHHH) + NCHAR(0xHHHH)”, which is just over twice as wide as those initial rows. You may need to adjust (add or remove spaces) since the width is controlled by the font used for Grid Results.
I added extra newlines after each field in the main select so that it would be easier to read.
GO CREATE FUNCTION dbo.ListAllUnicodeCodePoints() RETURNS TABLE AS RETURN WITH nums AS ( SELECT TOP (1114111) (ROW_NUMBER() OVER (ORDER BY @@MICROSOFTVERSION) - 1) AS [num] FROM [master].[sys].[all_columns] ac1 CROSS JOIN [master].[sys].[all_columns] ac2 ), chars AS ( SELECT n.[num], -- pass-through RIGHT(CONVERT(CHAR(6), CONVERT(BINARY(3), n.[num]), 2), CASE WHEN n.[num] > 65535 THEN 5 ELSE 4 END) AS [CodePointHex], CASE WHEN n.[num] > 65535 THEN 55232 + (n.[num] / 1024) ELSE NULL END AS [HighSurrogateINT], CASE WHEN n.[num] > 65535 THEN 56320 + (n.[num] % 1024) ELSE NULL END AS [LowSurrogateINT] FROM nums n WHERE n.[num] BETWEEN 0x000000 AND 0x014700 -- filter out 925,455 OR n.[num] BETWEEN 0x016800 AND 0x030000 -- unmapped code OR n.[num] BETWEEN 0x0E0001 AND 0x0E01EF -- points ) SELECT 'U+' + c.[CodePointHex] AS [CodePoint], c.[num] AS [CdPntINT], '0x' + c.[CodePointHex] AS [CdPntBIN], CASE WHEN c.[num] > 65535 THEN NCHAR(c.[HighSurrogateINT]) + NCHAR(c.[LowSurrogateINT]) ELSE NCHAR(c.[num]) END AS [Char], CASE WHEN c.[num] > 65535 THEN CONVERT(CHAR(10), CONVERT(BINARY(4), NCHAR(c.[HighSurrogateINT]) + NCHAR(c.[LowSurrogateINT])), 1) ELSE CONVERT(CHAR(6), CONVERT(BINARY(2), NCHAR(c.[num])), 1) END AS [UTF-16LE ], c.[HighSurrogateINT] AS [HighSrgtINT], c.[LowSurrogateINT] AS [LowSrgtINT], CONVERT(BINARY(2), c.[HighSurrogateINT]) AS [HighSrgtBIN], CONVERT(BINARY(2), c.[LowSurrogateINT]) AS [LowSrgtBIN], 'NCHAR(' + CASE WHEN c.[num] > 65535 THEN CONVERT(CHAR(6), CONVERT(BINARY(2), c.[HighSurrogateINT]), 1) + ') + NCHAR(' + CONVERT(CHAR(6), CONVERT(BINARY(2), c.[LowSurrogateINT]), 1) ELSE CONVERT(CHAR(6), CONVERT(BINARY(2), c.[num]), 1) END + ')' AS [T-SQL ], '&#x' + c.[CodePointHex] + ';' AS [HTML ], 'x' + CASE WHEN c.[num] > 65535 THEN CONVERT(CHAR(4), CONVERT(BINARY(2), c.[HighSurrogateINT]), 2) + 'x' + CONVERT(CHAR(4), CONVERT(BINARY(2), c.[LowSurrogateINT]), 2) ELSE CONVERT(CHAR(4), CONVERT(BINARY(2), c.[num]), 2) END AS [C/C++/C#/Java/?] FROM chars c; GO
Test
List all 188,657 code points:
SELECT cp.* FROM dbo.ListAllUnicodeCodePoints() cp;
List some emoji:
SELECT cp.* FROM dbo.ListAllUnicodeCodePoints() cp WHERE cp.[CdPntINT] BETWEEN 0x1F000 AND 0x1F9FF;
List the Tibetan characters, sorted naturally for that language:
SELECT cp.* FROM dbo.ListAllUnicodeCodePoints() cp WHERE cp.[CdPntINT] BETWEEN 0x0F00 AND 0x0FFF -- Tibetan ORDER BY cp.[Char] COLLATE Nepali_100_CS_AS;
Use directly as a Keyboard Query shortcut
A query shortcut allows you to execute code simply by hitting Ctrl and a number key. Cool. But first, there are a few things to know about query shortcuts:
- The window in SSMS where you configure the query shortcuts has the text field for the shortcut labeled as “Stored Procedure”, which is misleading because you can specify a query. You can even specify multiple queries.
- Whatever you specify needs to be a single line: no newlines / CRLFs. Any text past the first return will be truncated.
- When using the keyboard query shortcuts, if nothing is highlighted then only the code in the shortcut is executed. But, if any T-SQL is highlighted when you execute the shortcut, then the highlighted code is executed after the code stored in the shortcut finishes. Adding a
RETURN;
at the end of the shortcut simply stops the processing after the code stored in the shortcut finishes. BUT, if anything is highlighted when you execute a query shortcut, it is still parsed, even if not executed. Hence, you can still get parse errors even with theRETURN;
added at the end.
Calling a Function
You can create the iTVF shown above in a utility / admin / common database, and then reference that in the shortcut as follows:
SELECT cp.* FROM [your_Utility_DB].dbo.ListAllUnicodeCodePoints() cp; RETURN;
You can add filtering and/or sorting as well. Just remember to keep everything on a single line:
SELECT cp.* FROM [your_Utility_DB].dbo.ListAllUnicodeCodePoints() cp WHERE cp.[CdPntINT] BETWEEN 0x1F000 AND 0x1F9FF; RETURN;
To store any variation of that in a query shortcut, do the following (in SSMS, of course):
- Select the entire line from one of the boxes above, or from the query editor in SSMS
- Hit Ctrl + C to copy the query to the clipboard
- Switch to SSMS (if not already there)
- Go to the “Tools” menu
- Go to “Options…”
- Go to “Environment”
- Go to “Keyboard”
- Go to “Query shortcuts”
- Scroll to the end of the “Query shortcuts:” list
- Click the cursor in the “Stored Procedure” column of the “Ctrl+0” row (or any row you want, really)
- Hit Ctrl + V to paste the query into the text field
- Change “
your_Utility_DB
” to the name of the database where you created the function - Click the “OK” button
- PLEASE NOTE: the change will not affect any currently open tab. You will be able to use the shortcut starting with the next tab you open. To be clear: you do not need to restart SSMS for this change to take effect (like you do when changing the Grid Results font, for example)
No Function Needed
There might be situations in which you cannot guarantee the existence of a particular function or even database (such as your utility database). In that case, the query to return this list can be reduced to a single line so that it will work by itself as a query shortcut. By doing this, the query will not be dependent on the server you are connected to in order to function. Of course, this also means that if a change is to be made to the query and there are several people at the same place, all with this stand-alone query, then that change needs to be made in each person’s SSMS, as opposed to making the change one time to one function.
The following is a one-line representation of the query we placed into the ListAllUnicodeCodePoints
function, plus the RETURN;
:
To store that in a query shortcut, do the following (in SSMS, of course):
- Select the entire query in the box above
- Hit Ctrl + C to copy the query to the clipboard
- Switch to SSMS
- Go to the “Tools” menu
- Go to “Options…”
- Go to “Environment”
- Go to “Keyboard”
- Go to “Query shortcuts”
- Scroll to the end of the “Query shortcuts:” list
- Click the cursor in the “Stored Procedure” column of the “Ctrl+0” row (or any row you want, really)
- Hit Ctrl + V to paste the query into the text field
- Click the “OK” button
- PLEASE NOTE: the change will not affect any currently open tab. You will be able to use the shortcut starting with the next tab you open. To be clear: you do not need to restart SSMS for this change to take effect (like you do when changing the Grid Results font, for example)
Viewing Unicode Characters
Working with Unicode can sometimes be a little confusing because it is not all-or-none. Unicode is a specification, not software, so how it behaves depends on how a vendor implemented (or did not implement) any part of the specification. On top of that, there are different versions, and newer versions not only include new characters, but they sometimes make changes / correct sort weights and categorizations from previous versions. So, here are some things to keep in mind when looking through the list generated by ListAllUnicodeCodePoints()
.
How a code point behaves and how it appears are controlled by different systems. Behavior — comparison, sorting, uppercase / lowercase mappings — is controlled by SQL Server, in most cases by a collation. Newer versions of collations generally contain more definitions to guide those behaviors. The categorization of characters that are and are not valid for identifiers (i.e. names of any object in the database) is based on Unicode 3.2 and does not change across versions of SQL Server, or collation, or anything. (please see: “The Uni-Code: The Search for the True List of Valid Characters for T-SQL Regular Identifiers”, Part 1 and Part 2).
However, how a code point appears is controlled by fonts, and has nothing to do with SQL Server. Fonts get updated, and new fonts get created, much more frequently than new collation versions get released. Many code points have been added, and some updated, since any collation version was released, and there is no way to access any of those new or updated definitions until a new collation version is released. But, you are still able to see newly defined code points as long as the font you are using has incorporated the new definition(s). Or, at least there is the possibility of seeing new code points.
The other tricky part of this puzzle is that fonts don’t contain definitions for all code points. I don’t believe it is even possible that they could. Fonts, I believe, are limited to 65,536 characters, and there are at least twice as many code points defined now. Some fonts have very few characters defined. To get a better sense of this, take a look at this page on fontspace.com that lists Unicode characters (page 313). On the right side it shows how many fonts include each particular character. The range on that page appears to be 2 – 19.
What all of this means is that, for any code point, if you see a character and not just a square box, then it is clearly defined. For any code point that appears as a square box, it might be undefined (by Unicode), or it might simply be missing from the font that you are using. And keep in mind that sometimes fonts don’t define certain characters correctly, or sometimes they might appear correct, but they might have combining properties that do not behave correctly, and thus do not render as expected when used with other characters.
That being said, I have had the best luck with the Code2003 font by James Kass. That font claims to map over 65,000 characters, and they look a lot better than the GNU Unifont characters. I use that font in my Grid Results:
- Download the font
- Install the font
- Switch to SSMS
- Go to the “Tools” menu
- Go to “Options…”
- Go to “Environment”
- Go to “Fonts and Colors”
- In the “Show settings for:” drop-down, select “Grid Results”
- In the “Font” drop-down, select “Code2003”
- Click the “OK” button
- This change does require exiting and restarting SSMS before you will see the change.
Keep in mind, if you see a character that you want to use, and then copy it to the clipboard and paste it somewhere (such as the query editor, or an email, or Notepad++, etc) and it shows up as a square box, that only means that the character isn’t in the font used by that other program. But, the character is still the character that you had copied, even if you can’t see it; the byte sequence is correctly (assuming that you are in a place that can handle Unicode and isn’t 8-bit / ASCII Extended only, although there you would probably just get a “?
” instead).
UTF-8 / SQL Server 2019
If you are using SQL Server 2019, then you have the option of adding the following block of code to the function to add an extra column for the UTF-8 encoded bytes. Just paste the following in just after the UTF-16LE
column:
CASE -- SQL Server 2019 and newer only! WHEN c.[num] > 65535 THEN CONVERT(CHAR(10), CONVERT(BINARY(4), CONVERT(CHAR(4), NCHAR(c.[HighSurrogateINT]) + NCHAR(c.[LowSurrogateINT]) COLLATE Latin1_General_100_CI_AS_SC_UTF8)), 1) ELSE CONVERT(CHAR(10), CONVERT(VARBINARY(3), CONVERT(VARCHAR(3), NCHAR(c.[num]) COLLATE Latin1_General_100_CI_AS_SC_UTF8)), 1) END AS [UTF-8 ],
Here are a few rows of what I get back:
Code Point | CdPnt INT | CdPnt BIN | Char | UTF-16LE | UTF-8 | HighSrgt INT | LowSrgt INT | HighSrgt BIN | LowSrgt BIN | T-SQL | HTML | C/C++/C#/Java/? |
---|---|---|---|---|---|---|---|---|---|---|---|---|
U+006D | 109 | 0x006D | m | 0x6D00 | 0x6D | NULL | NULL | NULL | NULL | NCHAR(0x006D) | m | x006D |
U+0478 | 1144 | 0x0478 | ? | 0x7804 | 0xD1B8 | NULL | NULL | NULL | NULL | NCHAR(0x0478) | Ѹ | x0478 |
U+0959 | 2393 | 0x0959 | ? | 0x5909 | 0xE0A599 | NULL | NULL | NULL | NULL | NCHAR(0x0959) | ख़ | x0959 |
U+10A26 | 68134 | 0x10A26 | ?? | 0x02D826DE | 0xF090A8A6 | 55298 | 56870 | 0xD802 | 0xDE26 | NCHAR(0xD802) + NCHAR(0xDE26) | 𐨦 | xD802xDE26 |
End Notes
The script to create this function, including a commented-out UTF-8 column that you should uncomment if you’re using SQL Server 2019 (or newer, someday), can be found on PasteBin.com: T-SQL Function (iTVF) to List All Unicode Code Points.
This is just a temporary Phase 1 of a much larger project that I am working on. This function will eventually be replaced by something even better.
But for now, enjoy…. ?? ??