(last updated: 2018-04-09)
In answering a recent question on DBA.StackExchange related to why some characters work for parameter names and others do not ( How to create Unicode stored procedure parameter names ), I pointed out that the documentation for Database Identifiers states (slightly edited for readability):
Rules for Regular Identifiers
|
Ok. So what are the valid “letters” and “decimal numbers” from other national scripts? There are 65,536 Unicode code points (often refered to as “characters”) that are not supplementary characters. While clearly characters in blocks such as Arrows, Mathematical Operators, Box Drawing, Dingbats, etc do not qualify, that still leaves a lot of characters that probably are valid. How can we find the exact list of valid characters?
Step Numero ௧ (that’s a “TAMIL DIGIT ONE” U+0BE7)
The Unicode website has an online research tool, so I started there. Looking through the list of available character properties to filter on, I didn’t see anything for “letter”, but I did find two that looked promising: “Alphabetic” (a boolean) and “Number_Type” (an enumeration which has “Decimal” as an option). Both of those searches are the following two links, each one indicating the number of total code points / characters returned:
[:Age=3.2:] & [:BMP=Yes:] & [:Alphabetic=Yes:] (46,153 Code Points)
[:Age=3.2:] & [:BMP=Yes:] & [:Numeric_Type=Decimal:] (189 Code Points)
Please note:
- “Age=3.2” includes all characters that were defined as of Unicode 3.2 (meaning it also includes characters added prior to version 3.2)
- “BMP=Yes” filters out supplementary characters
I created several variables using a variety of letters from the first search and a variety of numbers from the second search, and everything worked. I then tried a few letters from a search that was mostly the same as the first search shown above, but did not have the “& [:BMP=Yes:]” specified so it returned supplementary characters that existed in Unicode 3.2 and are classified as “alphabetic”. The supplementary characters, even though they were alphabetic, did not work.
Everything appears to be working as described in the documentation.
So are we done? Can we call it a day and go home?
Well, are these things ever that easy and/or straight forward?
Step Numero ๒ (that’s a “THAI DIGIT TWO” U+0E52)
Quite often these types of things are not that easy. Yes, it is very tempting to assume that the limited test is good enough and that we did indeed find the exact list of characters (plus we would need to add in the four extra characters: at sign (@), dollar sign ($), number sign (#), and underscore (_)). However, based on my experiences, it seems that more often than not, doing an exhaustive test results in a slightly different answer that invalidates the previous conclusion (which was based on the limited test). So, while it does take more time to do more extensive testing, it seems like we have little choice if we truly want to know how these things actually work.
What that means is, at the very least, we need to get the complete list of characters accepted by SQL Server for non-delimited identifiers to make sure that the totals match the number of code points returned by the searches done in Step 1.
Of course, getting this list of accepted characters is a bit tricky because:
- The only way to determine if a character is valid is to not get an error when attempting to use it. So we need to test all 65,535 characters using Dynamic SQL within a
TRY...CATCH
construct, and record that ones that work in theTRY
block. - Testing by declaring a variable would work for “subsequent” characters (of an identifier), but it won’t work for determining “first” characters since the first character of a variable name is always going to be the
@
sign. - When testing for “subsequent” characters, there needs to be a valid character that follows the test character since whitespace characters won’t cause an error when placed at the end of an identifier (since they won’t be seen as being part of the identifier)
With all of that in mind, I tested for “first” characters by declaring a table variable and using the test character for a column name, as shown below:
USE [UnicodeTesting]; SET NOCOUNT ON; -- DROP TABLE dbo.FirstCharacter; CREATE TABLE dbo.FirstCharacter ( [CodePoint] BINARY(2) NOT NULL CONSTRAINT [PK_FirstChar] PRIMARY KEY CLUSTERED WITH (FILLFACTOR = 100), [CodePointInt] AS (ISNULL(CONVERT(INT, [CodePoint]), -1)), [TheChar] AS (ISNULL(NCHAR([CodePoint]), N'')) ); DECLARE @Index INT = 1, @SQL NVARCHAR(MAX), @Counter INT = 0, @TheChar NCHAR(1), @CodePoint NVARCHAR(10); WHILE (@Index < 65536) BEGIN SET @TheChar = NCHAR(@Index); SET @SQL = N'DECLARE @CharTest TABLE (' + @TheChar + N' INT);'; SET @CodePoint = CONVERT(NVARCHAR(10), CONVERT(BINARY(2), @Index), 1); BEGIN TRY -- PRINT @SQL; EXEC (@SQL); SET @Counter += 1; INSERT INTO dbo.FirstCharacter ([CodePoint]) VALUES (CONVERT(VARBINARY(2), @Index)); RAISERROR(N'%5d: Code Point %s ( %s )', 10, 1, @Counter, @CodePoint, @TheChar) WITH NOWAIT; END TRY BEGIN CATCH SET @TheChar = N''; -- swallow (i.e. ignore) the error END CATCH; SET @Index += 1; END; GO
That ran for almost 4 minutes and captured 45,695 characters. Now, the search for “alphabetic” + “BMP” (i.e. non-supplementary) characters as of Unicode 3.2 returned 46,153 code points. And, if we add in the extra two characters noted in the documentation as valid for the first character — #
and _
(@
wouldn’t be returned by the T-SQL test due to it not being a valid first character for anything but a variable name) — that gives us a target value of 46,155 characters that the T-SQL test should have returned. Yet, the T-SQL test found 460 characters less than what I was expecting. That’s not good, but we need to test the other half before drawing any conclusions.
Then I tested for valid “subsequent” characters by declaring a variable, placing the test character between two US English letters, as shown below:
-- DROP TABLE dbo.SubsequentCharacter; SET NOCOUNT ON; CREATE TABLE dbo.SubsequentCharacter ( [CodePoint] BINARY(2) NOT NULL CONSTRAINT [PK_SubsequentChar] PRIMARY KEY CLUSTERED WITH (FILLFACTOR = 100), [CodePointInt] AS (ISNULL(CONVERT(INT, [CodePoint]), -1)), [TheChar] AS (ISNULL(NCHAR([CodePoint]), N'')) ); DECLARE @Index INT = 1, @SQL NVARCHAR(MAX), @Counter INT = 0, @TheChar NCHAR(1), @CodePoint NVARCHAR(10); WHILE (@Index < 65536) BEGIN SET @TheChar = NCHAR(@Index); SET @SQL = N'DECLARE @Char' + @TheChar + N'Test INT;'; SET @CodePoint = CONVERT(NVARCHAR(10), CONVERT(BINARY(2), @Index), 1); BEGIN TRY --PRINT @SQL; EXEC (@SQL); SET @Counter += 1; INSERT INTO dbo.SubsequentCharacter ([CodePoint]) VALUES (CONVERT(VARBINARY(2), @Index)); RAISERROR(N'%5d: Code Point %s ( %s )', 10, 1, @Counter, @CodePoint, @TheChar) WITH NOWAIT; END TRY BEGIN CATCH SET @TheChar = N''; -- swallow (i.e. ignore) the error END CATCH; SET @Index += 1; END; GO
That ran for 40 – 50 seconds and captured 46,543 characters. The search for “numeric_type = decimal” + “BMP” (i.e. non-supplementary) characters as of Unicode 3.2 returned 189 code points. Combining that with the total for valid “first” characters (i.e. 46,155), plus the two extra characters — @
and $
— gives us a target value of 46,346 characters that the T-SQL test should have returned. Yet, the T-SQL test found 197 characters more than what I was expecting. That’s also not good.
Clearly something is amiss. It could be the T-SQL tests, but I have gone over that process several times and it does indeed capture the correct characters. I have tested it on a SQL Server 2017 Developer instance with a Hebrew_100_BIN2
Collation, both SQL Server 2017 and 2012 Express LocalDB instances with a SQL_Latin1_General_CP1_CI_AS
Collation, and a SQL Server 2017 Express instance with a Korean_100_CS_AS_KS_WS_SC
Collation. The T-SQL results were identical in all cases.
So the problem is more likely to be either that:
- I used the incorrect categorization(s) in my searches on the Unicode website, or
- the categorization of certain characters has changed between Unicode versions 3.2 and 10.0, or
- both of the above
Step Numero ໓ (that’s a “LAO DIGIT THREE” U+0ED3)
In order to check the categorizations of the characters, we need to get the exact list of Unicode 3.2 characters. To get the original Unicode 3.2 data files, I went to https://www.unicode.org/Public/3.2-Update/ and grabbed the following two files:
- UnicodeData-3.2.0.txt (the main data)
- UnicodeData-3.2.0.html (describes the format of the UnicodeData.txt file)
I used Excel to import the data (SSMS import wizard isn’t very flexible). Using Excel, I was able to first import the data file and split on the ;
delimiters on the way in. Then I created a formula that uses CONCATENATE
to format an INSERT INTO
statement along with some other manipulation, such as prefixing the hex values with “Ox” and specifying NULL
for empty fields that should be INT
.
- I specified “delimited”, to start on line 1, and that the file was encoded as “65001: Unicode (UTF-8)” because the newer documentation states that UTF-8 is the proper encoding of the data files. It mentions that “non-ASCII characters only appear in comments” though I don’t see any non-ASCII characters. So, it might be fine to accept the default encoding of “437 : OEM United States”.
- On the next step I specified that the delimiter is
;
- On the next / final step I specified that all fields are of type “text”.
Once it imported, I entered the following formula in cell Q2. I started with row 2 because I had added a header row. I used column Q because the data is 15 columns, A through O, and I left column P empty to make it easier to see where the actual data ended:
Then I dragged the highlight around cell Q2, by the bottom-right corner, down to cell Q11636 so that it would apply that formula to all of the rows. Then I simply copied and pasted column Q into a SQL script that already included statements to create the Schema and Table. That script (on Pastebin.com) is:
Unicode 3.2, Part 1: Unicode Character Database (UCD)
The script is 1.68 MB and 11,784 lines, so give it a moment to load.
With that data alone we can do some pretty interesting queries, such as:
SELECT * FROM [v3-2].UnicodeCharacterDatabase ucd WHERE ucd.[TheChar] = N'R' COLLATE Latin1_General_100_CI_AI; -- 43 rows
That’s right folks, there are 43 variations of the letter “R” (various accents, cases, directions, fonts, etc).
However, that data alone is not enough for the properties that we are looking for.
Step Numero ൪ (that’s a “MALAYALAM DIGIT FOUR” U+0D6A)
I went back to https://www.unicode.org/Public/3.2-Update/ and grabbed the following two files:
- DerivedProperties-3.2.0.html (describes where to find, or how to determine, various properties)
- DerivedCoreProperties-3.2.0.txt (lists of code points per various properties)
The “DerivedProperties” HTML file tells us two important things:
- we can find the “alphabetic” property in the “DerivedCoreProperties.txt” file
- we can determine “numeric type = decimal” from the data that we already imported: just find rows where “DecimalDigitValue”, “DigitValue”, and “NumericValue” all have a value.
Just like with the main UnicodeData file in Step 3, I imported the “DerivedCoreProperties.txt” file into Excel. There are only two columns, so I placed the following formula in cell D1 and applied it to all 4216 lines.
This formula ignores empty lines and comment lines (i.e. lines starting with #
), and then formats the remaining lines to match the specified code points. I then copied and pasted column D (just rows 165 – 573 as they are for the “Alphabetic” property and are not supplementary characters) into a script that added a BIT
column for IsAlphabetic
to the main UnicodeCharacterDatabase
table that was created in Step 3, and did a simple UPDATE
to set the matching code points to 1
.
The script then adds a BIT
column for IsDecimal
(we don’t need to worry about the other numeric types, so a boolean will suffice) and does a simple update to set rows where the “DecimalDigitValue”, “DigitValue”, and “NumericValue” columns are all NOT NULL
.
That script (on Pastebin.com, and only 22 KB) is:
Unicode 3.2, Part 2: Derived Props (Alpha & Dec)
We now have all of the Unicode 3.2 characters loaded, and have the two additional properties that we were looking for. Woo hoo! We have finally gathered enough data to do some meaningful tests and comparisons.
Step Numero ౫ (that’s a “TELUGU DIGIT FIVE” U+0C6B)
The following two queries compare the T-SQL “first” character list to the “alphabetic” characters, and the T-SQL “subsequent” character list to both the “alphabetic” and “numeric_type = decimal” characters, to find the ones that are on only one side. Hopefully, seeing which characters aren’t matching up will point us in the right direction.
SELECT CASE WHEN fc.[CodePoint] IS NULL THEN 'Missing' ELSE 'Extra' END AS [␦], COALESCE(fc.[CodePoint], ucd.[CodePoint]) AS [CodePoint], COALESCE(fc.[TheChar], ucd.[TheChar]) AS [TheChar], fc.[CodePoint] AS [fc], ucd.[CodePoint] AS [ucd], '---' AS [---], ucd.* FROM dbo.FirstCharacter fc FULL JOIN [v3-2].[UnicodeCharacterDatabase] ucd ON ucd.[CodePoint] = fc.[CodePoint] WHERE ( -- find extra fc.[CodePoint] IS NOT NULL AND ucd.[IsAlphabetic] IS NULL ) OR ( -- find missing fc.[CodePoint] IS NULL AND ucd.[IsAlphabetic] = 1 ) ORDER BY [␦], COALESCE(fc.[CodePoint], ucd.[CodePoint]); -- Off by 361: 3 extra (expected 2), and 358 missing (unexpected) -- Extra: 0x0023 (#), 0x005F (_), -- 0xFF3F (_ ; FULLWIDTH LOW LINE ; unexpected) SELECT CASE WHEN sc.[CodePoint] IS NULL THEN 'Missing' ELSE 'Extra' END AS [␦], COALESCE(sc.[CodePoint], ucd.[CodePoint]) AS [CodePoint], COALESCE(sc.[TheChar], ucd.[TheChar]) AS [TheChar], sc.[CodePoint] AS [sc], ucd.[CodePoint] AS [ucd], '---' AS [---], ucd.* FROM dbo.SubsequentCharacter sc FULL JOIN [v3-2].[UnicodeCharacterDatabase] ucd ON ucd.[CodePoint] = sc.[CodePoint] WHERE ( -- find extra sc.[CodePoint] IS NOT NULL AND ucd.[IsAlphabetic] IS NULL AND ucd.[IsDecimal] IS NULL ) OR ( -- find missing sc.[CodePoint] IS NULL AND ( ucd.[IsAlphabetic] = 1 OR ucd.[IsDecimal] = 1 ) ) ORDER BY [␦], COALESCE(sc.[CodePoint], ucd.[CodePoint]); -- Off by 315: 295 extra (expected 4), and 20 missing (unexpected) -- Expected extra: 0x0023 (#), 0x0024 ($), 0x0040 (@), 0x005F (_)
Um, ok. So now we have two problems: not only are the different sources (T-SQL vs Unicode 3.2 data) still not matching up, but now they are off by different amounts than they were based on the online searches. We should probably verify the number of characters in both the “Alphabetic” and “Decimal” categories to see how they match up to what we found online.
SELECT SUM(CONVERT(INT, ucd.[IsAlphabetic])) AS [Alphabetic], SUM(CONVERT(INT, ucd.[IsDecimal])) AS [Numeric_Type=Decimal] FROM [v3-2].UnicodeCharacterDatabase ucd; -- Alphabetic Numeric_Type=Decimal -- 46050 218
Well, this confirms one of the possible explanations noted at the end of Step 2: properties can change across versions of Unicode. When Unicode version 3.2 was published, there were 46,050 characters classified as “Alphabetic”. As of version 10.0 of Unicode (the version used by the online tool), out of that same pool of characters that were available as of version 3.2, now 46,153 of them are classified as “Alphabetic”. That is a net increase of 103 characters in that classification (I say “net” because some characters may have been removed from the classification).
Similarly, when Unicode version 3.2 was published, there were 218 characters classified as “Numeric_Type=Decimal”. In Unicode version 10.0, out of that same pool of characters that were available as of version 3.2, now only 198 of them are classified as “Numeric_Type=Decimal”. That is a decrease of 20 characters in that classification.
We will take a closer look at various differences in classifications of the same character across versions of Unicode later. For now, it is clear that specifying [:Age=3.2:]
in the online search is merely a filter to get the characters available as of that version. It does not imply that any other property being filtered on will use values as they were back in version 3.2; the property values being used are the current 10.0 values (or whatever version is indicated at the bottom of the “UnicodeSet” page). Meaning, the online search cannot be used for any historical research outside of simply seeing which characters were available as of a particular version of Unicode. All historical research related to behavior can only be done via the original data files.
Intermission
Please join us next time for the exciting conclusion (including a link to the complete list of valid characters)…