(last updated: 2019-01-10 @ 16:00 EST / 2019-01-10 @ 21:00 UTC )
(NOTE: For recent update, please see “Update for CTP 2.2” section)
For some time now, many of us have struggled with data that is mostly standard US-English / ASCII characters but also needs to allow for the occasional non-ASCII character. VARCHAR
is typically one byte per character but can’t represent a wide range of accented characters all at the same time (more than would be found on a single 8-bit Code Page). NVARCHAR
, being a Unicode datatype, can represent all characters, but at a cost: each character is typically 2 bytes. When we have data that is most often standard ASCII, but has the potential (whether it ever happens or not) to have non-ASCII characters (names, URLs, etc), we have no choice but to use NVARCHAR
. However, then we are wasting space for all of the data that could fit into VARCHAR
and take up half as much space. While some claim that “disk is cheap“, wasting space negatively impacts query performance, backup size, backup and restore times, etc.
One approach is to have two columns for the data, one for each datatype. The idea is to store values that are 100% ASCII in the VARCHAR
column, and anything else in the NVARCHAR
. Then you can have a computed column return the non-NULL column for that data. While this approach does work, it is not something you are going to implement in 50 or more columns.
So, thankfully, Microsoft provided a solution to this problem.
And it is…
(drum-roll please)…
SQLCLR (introduced in SQL Server 2005). This allows one to use .NET to access GZip/Deflate functionality such that you can compress data into VARBINARY(MAX)
on the way in, and uncompress back to NVARCHAR(MAX)
on the way out. And you can even get pre-done versions of these — Util_GZip and Util_GUnzip —in the Free version of SQL# :-). Of course, some folks don’t like SQLCLR (for some reason), and you need to decompress in order to view / search / filter on the value, and this might be a bit much to do on many columns (though not as bad as having both VARCHAR
and NVARCHAR
versions of the column).
So, thankfully, Microsoft provided a solution to this problem.
And it is…
Data Compression (introduced in SQL Server 2008). But this was a bit simplistic, only available in Enterprise Edition, and didn’t work for off-row values (i.e. row overflow and MAX values that can’t fit in-row).
So, thankfully, Microsoft provided a solution to this problem.
And it is…
Unicode Compression (introduced in SQL Server 2008 R2, and implemented as part of Data Compression). But this was still only available in Enterprise Edition, and still didn’t work for off-row values, and in-row NVARCHAR(MAX)
values only get the simplistic compression.
So, thankfully, Microsoft provided a solution to this problem.
And it is…
Compression for the all editions, not just Enterprise (as of SQL Server 2016, SP1). But, this still didn’t work for off-row values, and in-row NVARCHAR(MAX)
values only get the simplistic compression.
So, thankfully, Microsoft provided a solution to this problem.
And it is…
COMPRESS
and DECOMPRESS
built-in functions (introduced in SQL Server 2016). You can use these to Gzip and Ungzip values (without resorting to SQLCLR), and store the much smaller VARBINARY value. Do this in stored procedures to be transparent to the caller. But this is not for everyone, and this might be a bit much to do on many columns.
So, thankfully, Microsoft provided a solution to this problem.
And it is…
Support for MAX
types in Clustered Columnstore Indexes (as of SQL Server 2017). But, this is not for every scenario, especially because it affects the whole table.
So, thankfully, Microsoft provided a solution to this problem.
And it is…
UTF-8 Encodings (introduced in the upcoming SQL Server 2019). UTF-8 appears to be the ideal solution: standard ASCII characters take up only 1 byte, all Unicode characters are supported, the value is still a string (not binary) so it can be searched / filtered on / indexed / etc, and it is the preferred encoding for HTML and XML files. No need for an extra column plus a computed column, no need for SQLCLR or COMPRESS
, no binary values, and no getting frustrated that Unicode Compression doesn’t work with NVARCHAR(MAX)
. What’s not to love?
Let’s explore…
TL; DR: While interesting, the new UTF-8 Collations only truly solve a rather narrow problem, and are currently too buggy to use with confidence, especially as a database’s default Collation. These encodings really only make sense to use with NVARCHAR(MAX)
data, if the values are mostly ASCII characters, and especially if the values are stored off-row. Otherwise, you are better off using Data Compression, or possibly Clustered Columnstore Indexes. Unfortunately, this feature provides much more benefit to marketing than it does to users.
The problem is: why does this feature even exist? Why devote development resources to it? What problem(s) does it solve? It wasn’t needed for importing or exporting data since that was resolved two versions ago (in SQL Server 2016). And I certainly hope it wasn’t to save space (compared to NVARCHAR
/ UTF-16) since that isn’t what UTF-8 was designed for (or does). The reason why UTF-8 (as an encoding, irrespective of SQL Server) was created was to address compatibility (with existing ASCII-based systems) 1 , not efficiency (of space or speed) 2 . In fact, compatibility was achieved at the expense of efficiency, given that most characters take up more space in UTF-8 than they do in UTF-16. But, that was an acceptable trade-off for its intended use: allowing existing OSes (primarily Unix but soon also Windows) to be Unicode-capable while not breaking any existing functionality. UTF-8 sometimes saving space over UTF-16 (but only for 128 characters, mind you) is a side-effect, not a design goal. Which means that it’s good to know of the option, but it shouldn’t be used as a selling point for this feature (yet so far it’s the only thing mentioned regarding this feature).
What would have been a huge benefit to more users (better compression and not limited to only helping ASCII characters) is to get Unicode Compression working for NVARCHAR(MAX). I don’t know what technical issues are getting in the way of implementing Unicode Compression for NVARCHAR(MAX)
, but given how much will break (or will at least be clunky) in making UTF-8 work, I have a hard time believing that this far better alternative would have been any worse.
Basics
Documentation
Let’s start with what we are told about this new feature. According to the documentation, the new UTF-8 Collations:
- can be used …
- as a database-level default Collation
- as a column-level Collation
- by appending “
_UTF8
” to the end of any Supplementary Character-Aware Collation (i.e. either having “_SC
” in their name, OR being of level140
or newer and not being binary) - with only the
CHAR
andVARCHAR
- (implied) have no effect on
NCHAR
andNVARCHAR
data (meaning: for these types, the UTF-8 Collations behave the same as their non-UTF-8 equivalents - “This feature may provide significant storage savings, depending on the character set in use.” (emphasis mine)
Installing
When installing SQL Server 2019, the default Collation (at least for systems with English as the OS-level language) is still “SQL_Latin1_General_CP1_CI_AS
“. Why, why, WHY???? Why not use “Latin1_General_100_CI_AS_SC
“, which has the same sensitivities but is newer, has sort weights for more characters, supports Supplementary Characters, and is a Windows Collation (hence you don’t prevent seeks on indexed VARCHAR columns when compared to Unicode data). Sure, changing the default Collation after all these years will result in some headaches when needing to work with existing systems (that didn’t bother to change the default Collation), but it ensures that new projects started by folks who don’t know to change it will spread farther and farther throughout the galaxy, tightening its stranglehold on us (i.e. making it even harder to move away from). But that has nothing to do with UTF-8, so we can ignore it for now ( <grrrrr> , but please vote for Make Latin1_General_(100)_CI_AS the default collation for US English).
On the “Customize Database Engine Collation” screen, when selecting a Windows Collation:
- Collations without a version number do not allow checking either the “Supplementary characters” or “Variation selector-sensitive” options.
- Collations with version numbers “90” or “100” allow for checking the “Supplementary characters” option, but not the “Variation selector-sensitive” option.
- Collations with version number “140” have the “Supplementary characters” option checked (and cannot be unchecked), and allow for checking the “Variation selector-sensitive” option.
- (NOTE: For recent update that renders this point obsolete, please see “Update for CTP 2.1” section) There is no means of selecting a UTF-8 Collation, even if the “Supplementary characters” option is checked.
Using
By executing the queries below, we can confirm that adding _UTF8
to the end of Supplementary Character-Aware Collations does produce valid Collation names. I executed the following in [master]
(which is guaranteed to not be using a UTF-8 Collation as its default Collation since there was no way to install SQL Server with one):
-- Version 90 (SQL 2005; _SC version of it started with SQL Server 2012) SELECT 'a' COLLATE Japanese_90_CS_AS_SC_UTF8; -- Version 100 (SQL 2008; _SC version of it started with SQL Server 2012) SELECT 'a' COLLATE Latin1_General_100_CI_AS_SC_UTF8; -- Version 140 (SQL 2017; no _SC version since it is implicitly supported) SELECT 'a' COLLATE Japanese_XJIS_140_CS_AS_UTF8;
All of the above queries returned successfully.
The following query will list all of the UTF-8 Collations:
SELECT col.[name], COLLATIONPROPERTY(col.[name], 'CodePage') AS [CodePage], COLLATIONPROPERTY(col.[name], 'Version') AS [Version] FROM sys.fn_helpcollations() col WHERE col.[name] LIKE N'%[_]UTF8' ORDER BY col.[name]; -- 1552 rows, all with CodePage = 65001
There are 5507 total Collations in SQL Server 2019, which is 1552 more than the 3955 that exist in SQL Server 2017. Hence the only new Collations are these “_UTF8
” Collations.
If you execute the query above, do you notice anything missing? That’s right: no binary options for the “_UTF8
” Collations (this is a mistake that needs to be addressed).
The following query shows that we can indeed use use VARCHAR
to store Unicode characters, but only when using one of the new UTF-8 Collations:
SELECT NCHAR(27581), -- 殽 CONVERT(VARCHAR(3), NCHAR(27581) COLLATE Latin1_General_100_CI_AS_SC), -- ? CONVERT(VARCHAR(3), NCHAR(27581) COLLATE Latin1_General_100_CI_AS_SC_UTF8); -- 殽
The following query confirms that the new UTF-8 Collations cannot be used with the TEXT
datatype:
SELECT CONVERT(TEXT, N'a' COLLATE Latin1_General_100_CI_AS_SC_UTF8);
returns:
Msg 4189, Level 16, State 0, Line XXXXX
Cannot convert to text/ntext or collate to 'Latin1_General_100_CI_AS_SC_UTF8' because these legacy LOB types do not support UTF-8 or UTF-16 encodings. Use types varchar(max), nvarchar(max) or a collation which does not have the _SC or _UTF8 flags.
The following query shows that the NVARCHAR
value is the same between the UTF-8 and non-UTF-8 Collation, and the same character can be held in VARCHAR
, but only when using a UTF-8 Collation, and as a different underlying byte sequence:
DECLARE @Sample NVARCHAR(1) = NCHAR(0x5010); SELECT @Sample; -- 倐 SELECT CONVERT(VARBINARY(2), @Sample COLLATE Latin1_General_100_CI_AS_SC), -- 0x1050 CONVERT(VARBINARY(2), @Sample COLLATE Latin1_General_100_CI_AS_SC_UTF8), -- 0x1050 CONVERT(VARCHAR(10), @Sample COLLATE Latin1_General_100_CI_AS_SC), -- ? CONVERT(VARCHAR(10), @Sample COLLATE Latin1_General_100_CI_AS_SC_UTF8), -- 倐 CONVERT(VARBINARY(2), CONVERT(VARCHAR(10), @Sample COLLATE Latin1_General_100_CI_AS_SC_UTF8)); -- 0xE580;
And, the following demonstrates that a UTF-8 Collation can be used at the database level:
CREATE DATABASE [UTF8] COLLATE Latin1_General_100_CI_AS_SC_UTF8; ALTER DATABASE [UTF8] SET RECOVERY SIMPLE; USE [UTF8];
The USE
statement is there because the remaining examples that follow in this post will be executed from within the “UTF8
” database.
General UTF-8 Info
UTF-16 (for context)
UTF-16, which is the Unicode encoding used for NVARCHAR
data, uses 16-bit (i.e. 2-byte) values, known as “code units“, to represent characters. These “code units” are used individually (a 2-byte code unit), or in pairs (2 two-byte code units) to get Supplementary Characters, to map to individual “code points“. UCS-2, which only maps the BMP characters (i.e. the first 65,536 code points), is a fixed-length encoding because it only deals with single code units. UTF-16, which uses the exact same code units as UCS-2, can use certain combinations of two of those code units (called Surrogate Pairs) to map the non-BMP characters (i.e. code points 65,537 and above, known as Supplementary Characters).
Because these values are two bytes, they are subject to “endianness”, which refers to the order in which the individual bytes that make up the value are stored internally (in memory, on disk, etc). The hardware determines the “endianness” of how multi-byte values (including INT
, SMALLINT
, BIGINT
) are stored, and the two possibilities are “Big Endian”, which is the given / natural ordering (i.e. value 0x1234
— where 0x12
is byte #1, and 0x34
is byte #2 — is stored as 0x1234
), or “Little Endian”, which refers to storing the bytes in reverse-order (i.e. value 0x1234
is stored as 0x3412
).
Unicode strings in Windows, “string” in .NET, and NVARCHAR
in SQL Server are all UTF-16 Little Endian.
UTF-8
UTF-8 is another Unicode encoding which uses between one and four 8-bit (i.e. 1-byte) code units to represent characters. Even though it is Unicode, due to working 8-bits at a time, this encoding is considered VARCHAR
data. This is also a variable-length encoding because it requires anywhere from 1 to 4 code units to map to any given code point. Due to these code units being single bytes (even in combinations of 2, 3, or 4), there is no concept of “endianness”.
One main advantage of UTF-8 is that it’s fully compatible with any existing data that’s comprised entirely of 7-bit ASCII characters (values 0 – 127). This is by design, but not really a reason for SQL Server to use it for storing data internally. The other main advantage of UTF-8 is that it can take up half the space while still allowing for the full range of Unicode characters. This would be a reason for SQL Server to use it internally. BUT, it only takes up half the space for a small subset of characters. Of course, if the characters in this small subset that are only 1 byte are the vast majority of characteres are being used, then this encoding does provide for significant space savings (in terms of disk space, memory, and reduced network traffic). What people tend to overlook, though, is that most Unicode characters require 3 bytes, which is 1 more byte than UTF-16 for those same characters.
You can execute the following query in SQL Server 2019 to see how many bytes each character requires for both UTF-8 and UTF-16 encodings. It returns all 65,536 BMP (Base Multilingual Plan) characters (which is also the entire UCS-2 character set), and 3 Supplementary Characters. Since all Supplementary Characters are 4 bytes in both encodings, there is no need to return more of them, but we do need to see a few of them to see that they are a) all 4 bytes, and b) encoded slightly differently. The query will work the same regardless of whether or not the current database uses a “_UTF8
” Collation. It also does not matter if the database’s default Collation is Supplementary Character-Aware or not.
;WITH nums([CodePoint]) AS ( SELECT TOP (65536) (ROW_NUMBER() OVER (ORDER BY (SELECT 0)) - 1) FROM [master].[sys].[columns] col CROSS JOIN [master].[sys].[objects] obj ), chars AS ( SELECT nums.[CodePoint], CONVERT(VARCHAR(4), NCHAR(nums.[CodePoint]) COLLATE Latin1_General_100_CI_AS_SC_UTF8) AS [TheChar], CONVERT(VARBINARY(4), CONVERT(VARCHAR(4), NCHAR(nums.[CodePoint]) COLLATE Latin1_General_100_CI_AS_SC_UTF8)) AS [UTF8] FROM nums UNION ALL SELECT tmp.val, CONVERT(VARCHAR(4), CONVERT(NVARCHAR(5), tmp.hex) COLLATE Latin1_General_100_CI_AS_SC_UTF8) AS [TheChar], CONVERT(VARBINARY(4), CONVERT(VARCHAR(4), CONVERT(NVARCHAR(5), tmp.hex) COLLATE Latin1_General_100_CI_AS_SC_UTF8)) AS [UTF8] FROM (VALUES (65536, 0x00D800DC), -- Linear B Syllable B008 A (U+10000) (67618, 0x02D822DC), -- Cypriot Syllable Pu (U+10822) (129384,0x3ED868DD) -- Pretzel (U+1F968) ) tmp(val, hex) ) SELECT chr.[CodePoint], COALESCE(chr.[TheChar], N'TOTALS:') AS [Character], chr.[UTF8] AS [UTF8_Hex], DATALENGTH(chr.[UTF8]) AS [UTF8_Bytes], COUNT(CASE DATALENGTH(chr.[UTF8]) WHEN 1 THEN 'x' END) AS [1-byte], COUNT(CASE DATALENGTH(chr.[UTF8]) WHEN 2 THEN 'x' END) AS [2-bytes], COUNT(CASE DATALENGTH(chr.[UTF8]) WHEN 3 THEN 'x' END) AS [3-bytes], COUNT(CASE DATALENGTH(chr.[UTF8]) WHEN 4 THEN 'x' END) AS [4-bytes], --- CONVERT(VARBINARY(4), CONVERT(NVARCHAR(3), chr.[TheChar])) AS [UTF16(LE)_Hex], DATALENGTH(CONVERT(NVARCHAR(3), chr.[TheChar])) AS [UTF16_Bytes], --- ((DATALENGTH(CONVERT(NVARCHAR(3), chr.[TheChar]))) - (DATALENGTH(chr.[TheChar]))) AS [UTF8savingsOverUTF16] FROM chars chr GROUP BY ROLLUP ((chr.[CodePoint], chr.[TheChar], chr.[UTF8]));
There isn’t enough room here to show all 65,538 rows, so here is a summary of the info:
Code Points | Code Point Quantity | UTF-8 Size (bytes) | UTF-16 Size (bytes) | UTF-8 savings |
---|---|---|---|---|
0 – 127 | 128 | 1 | 2 | 1 |
127 – 2047 | 1920 | 2 | 2 | 0 |
2048 – 65535 | 63488 | 3 | 2 | -1 |
65536 – | lots | 4 | 4 | 0 |
As you can see, only the first 128 Code Points offer any size savings over UTF-16. The characters in that range are (minus the first 32 which are control characters, and the last character):
[space] | ! | " | # | $ | % | & | ' | ( |
) | * | + | , | – | . | / | 0 – 9 | : |
; | < | = | > | ? | @ | A – Z | [ | \ |
] | ^ | _ | ` | a – z | { | | | } | ~ |
Those characters, and their values (32 – 126), are exactly the same as the standard ASCII character set, and are the same across most code pages (certainly all code pages supported by SQL Server). Any characters not in the chart above do not offer any space savings, with most of the BMP characters actually taking up more space than they would if using UTF-16 (i.e. NVARCHAR
).
Setting and Storing UTF-8 Values
An encoding is how the characters are represented internally. Whether the source is a non-Unicode string, a Unicode string, or a binary string (i.e. series of bytes / hex bytes), the end result will be the chosen encoding’s value for that character.
First, let’s try setting a variable, in the “[UTF8]
” database, from a VARBINARY
literal:
DECLARE @HotDog VARCHAR(4) = 0xF09F8CAD; PRINT @HotDog; -- ( U+1F32D )
Returns: 🌭
(in the “Messages” tab)
As you can see, it produces the expected character from the given 4 bytes. However, those same 4 bytes do not produce the correct character when stored as NVARCHAR
, which expects UTF-16 Little Endian bytes sequences:
DECLARE @NotHotDog NVARCHAR(4) = 0xF09F8CAD; PRINT @NotHotDog; -- ( U+1F32D ) -- 鿰권
Now let’s look at how the size of the (N)VARCHAR
types impact what can be stored in them:
SELECT CONVERT( VARCHAR(3), 0xF09F8CAD) AS [TooSmallForUTF8-HotDog], DATALENGTH(CONVERT(VARCHAR(3), 0xF09F8CAD)) AS [ByteCountForTooSmall], CONVERT( VARCHAR(4), 0xF09F8CAD) AS [UTF8-HotDog], DATALENGTH(CONVERT(VARCHAR(4), 0xF09F8CAD)) AS [ByteCountForHotDog], ------ CONVERT(NVARCHAR(1), 0x3CD82DDF) AS [TooSmallForUTF16-HotDog], CONVERT(NVARCHAR(2), 0x3CD82DDF) AS [UTF16-HotDog];
Returns:
TooSmallFor UTF8-HotDog | ByteCountFor TooSmall | UTF8- HotDog | ByteCountFor HotDog | TooSmallFor UTF16-HotDog | UTF16- HotDog |
---|---|---|---|---|---|
0 | 🌭 | 4 | � | 🌭 |
The first column appears to return nothing, not even a partial, incorrect / unknown character. The second column confirms that there is absolutely nothing resulting from attempting to store a 4-byte UTF-8 value in a VARCHAR
type that is only 3 bytes. It couldn’t even take just the first 3 bytes of the 4-byte value. The third and fourth columns show that given the type the required 4 bytes allows everything to work as expected.
On the other hand, when attempting to store the same character, using the UTF-16 4-byte sequence, into an NVARCHAR
type that is too small, the result is the default Replacement Character ( U+FFFD ). This is because it was able to store the first half of the Surrogate Pair (the 0x3CD8
), which is meaningless on its own, but it is still a valid byte sequence.
Size and Performance
{ This section is taking a while as I figure out how to best present the variety of tests and their results, but the general info is all there. }
Data Profiles
- Some ASCII (values 1 – 254), but mostly non-ASCII
[N]VARCHAR(500)
- 10,539 rows
- Characters per row: MIN = 1; AVG = 14; MAX = 30
- Some ASCII MAX
[N]VARCHAR(MAX)
- 10,539 rows
- Characters per row: MIN = 1; AVG = 14; MAX = 30
- All ASCII
[N]VARCHAR(500)
- 105,390 rows
- Characters per row: MIN = 6; AVG = 24; MAX = 60
- All ASCII MAX (all in row)
[N]VARCHAR(MAX)
- 105,390 rows
- Characters per row: MIN = 6; AVG = 24; MAX = 60
- All ASCII MAX (10k characters per row; all off-row)
[N]VARCHAR(MAX)
- 10,539 rows
- Characters per row: MIN = 10,007; AVG = 10,025; MAX = 10,061
- All ASCII MAX (5k characters per row; mixed in / off -row; UTF-8 = in-row,
NVARCHAR
= off-row)[N]VARCHAR(MAX)
- 21,078 rows
- Characters per row: MIN = 5,007; AVG = 5,025; MAX = 5,061
Tests:
These were executed in a database which has a “_UTF8
” default Collation.
SELECT REVERSE([Value]) FROM dbo.[{table}]; SELECT SUBSTRING([Value], 5, 13) FROM dbo.[{table}]; SELECT [Value] FROM dbo.[{table}] WHERE [Value] = 'abcdef12345'; -- UTF16 test uses N'abcdef12345' SELECT [Value] FROM dbo.[{table}] WHERE [Value] LIKE '%123%'; -- UTF16 test uses N'%123%' SELECT MIN(LEN([Value])), AVG(LEN([Value])), MAX(LEN([Value])) FROM dbo.[{table}]; SELECT [Value] FROM dbo.[{table}];
Each test was wrapped in:
SET STATISTICS TIME ON; ...{test}... SET STATISTICS TIME OFF;
Results
- Some ASCII in non-MAX types
NVARCHAR
is smaller here when Compression =NONE
,ROW
, orPAGE
.COLUMNSTORE
estimated better as well, but not tested.- Did not test performance given no space savings
- Some ASCII in MAX types (in-row)
NVARCHAR
is smaller here when Compression =NONE
,ROW
, orPAGE
.COLUMNSTORE
estimated the same, but not tested.- Did not test performance given no space savings
- All ASCII in non-MAX types
- UTF-8 is smaller here when Compression =
NONE
(approx. 37% smaller thanNVARCHAR
) - UTF-8 is only slightly smaller here when Compression =
ROW
orPAGE
(approx. 1% smaller) NVARCHAR
is slightly smaller here when table isCOLUMNSTORE
(approx. 1% smaller)- Performance of
NVARCHAR
is same or better than UTF-8
- UTF-8 is smaller here when Compression =
- All ASCII in MAX types (in-row)
- UTF-8 is smaller here when Compression =
NONE
(approx. 37% smaller thanNVARCHAR
; 40% forROW
andPAGE
) - UTF-8 is only slightly smaller here when table is
CLUSTERED COLUMNSTORE
(approx. 9% smaller) - Performance is sometimes the same, but
NVARCHAR
is usually better than UTF-8
- UTF-8 is smaller here when Compression =
- All ASCII in MAX types (10k; off-row)
- UTF-8 is smaller here when Compression =
NONE
(approx. 46% smaller) - ROW and PAGE compression do not apply to off-row data
- UTF-8 is only slightly smaller here when table is
CLUSTERED COLUMNSTORE
(approx. 1.3% smaller). TheNVARCHAR
table is still 98% smaller than it was originally. - With no compression,
NVARCHAR
is faster for all tests. - When table is
CLUSTERED COLUMNSTORE
, NVARCHAR is faster for all tests except “SUBSTRING” and “=”.
- UTF-8 is smaller here when Compression =
- All ASCII in MAX types (5k; mixed on / off-row:
VARCHAR
= on,NVARCHAR
= off)- UTF-8 is smaller here when Compression =
NONE
,ROW
, orPAGE
(approx. 25% smaller) - UTF-8 is much smaller here when table is
CLUSTERED COLUMNSTORE
(approx. 92% smaller). BUT, this is typically a non-issue since theNVARCHAR
table is still 96% smaller than it was originally. This becomes an issue as the uncompressed data gets closer to 100 GB. - With no compression,
NVARCHAR
is faster for all tests except “LIKE ‘%x%'”. - When table is
CLUSTERED COLUMNSTORE
, UTF-8 is faster for all tests except REVERSE (elapsed time was the same, but CPU was better)
- UTF-8 is smaller here when Compression =
Typically there is at least a slight performance hit when using UTF-8. I believe the issue is due to UTF-8 being a variable-length encoding, which means that each byte must be interpreted as it is read in order to know if it is a complete character or if the next byte is a part of it. This means that all string operations need to start at the beginning and proceed byte-by-byte. On the other hand, NVARCHAR / UTF-16 is always 2 bytes (even Supplementary Characters are comprised of two 2-byte Code Points), so everything can be read in 2-byte chunks.
The area of biggest improvement of UTF-8 over NVARCHAR
is when there is between 4001 and 8000 bytes of mostly standard ASCII characters. In this scenario, you would need to use NVARCHAR(MAX)
to get over 4000 bytes, but then you lose the ability to use Unicode Compression. And, being over 4000 bytes, the NVARCHAR
data must be stored off-row (a slight performance hit), while that same data can be stored in-row when using UTF-8. This is the only scenario where UTF-8 is better for both space and performance.
Limitations
Known (i.e. Stated in Documentation)
The UTF8 Collations do not work with:
- (NOTE: For recent update that renders this point obsolete, please see “Update for CTP 2.2” section) Replication
- Linked Servers
- In-memory OLTP (NOTE: For clarification based on recent testing, please see “Update for CTP 2.1” section)
- External Table for Polybase ( Two collation problems with external tables in SQL 2019 )
- (NOTE: For recent update that renders this point obsolete, please see “Update for CTP 2.1” section) UTF8 Collations cannot be used as the Instance-level Collation (** please see next section for the “EXCEPT, …”):
- Currently wording is that they can only be used at the database and column level, but yes, they can also be used inline via the
COLLATE
keyword. - Not an option via the installer (as mentioned at the top of this post)
- Cannot be specified via:
SETUP.EXE /QUIET /ACTION=REBUILDDATABASE ^
/SQLCOLLATION=Latin1_General_100_CI_AS_SC_UTF8 ^
...
The result is:
The following error occurred:
The collation Latin1_General_100_CI_AS_SC_UTF8 was not found.
Error result: -2061893630
Result facility code: 1306
Result error code: 2
- Currently wording is that they can only be used at the database and column level, but yes, they can also be used inline via the
Unknown (i.e. Oops)
- (NOTE: For recent update that renders this point obsolete, please see “Update for CTP 2.1” section) UTF8 Collations can be used as the Instance-level Collation. This cannot be set via the installer or running
SETUP.EXE
with the/SQLCOLLATION
switch, BUT they can be used via the undocumented method (of changing all Collations for an Instance):sqlservr -c -m -T4022 -T3659 -s"{instane_name}" ^
-q"Latin1_General_100_CI_AS_SC_UTF8"`
Yes, this does work. I have tested it. For more details on this method, please see:
Changing the Collation of the Instance, the Databases, and All Columns in All User Databases: What Could Possibly Go Wrong?
PLEASE NOTE: Just because you can do this doesn’t mean that you should do this. Given that the only way of setting an Instance to use a UTF8 Collation is via an undocumented method, it is quite possible that this has never been tested internally at Microsoft and is not guaranteed to work. It might work perfectly, or there could be unexpected behaviors. Use at your own risk. - There are no binary Collations that support the UTF8 encoding!!! Why not?. But why do we need them? Here are some reasons:
- Microsoft’s implementation of Unicode has incomplete, and some incorrect, sort weights. One example:
Choosing a binary collation that can differentiate between ‘ss’ and ‘ß’ for nvarchar column in Sql Server - When needing to know if two values are 100% identical, you need a binary Collation (such as when auditing for changes, etc). Even if all characters had correct sort weights, you still wouldn’t be able to differentiate between a base character plus a combining diacritic and the pre-composed single character equivalent. Nor would you be able to differentiate between a base character plus multiple combining diacritics and the same characters with a different order for the diacritics. In both of those cases, the visual representation of the “character” is the same, but the number or order of the underlying Code Points would be different. The first item from this post provides a working example: No, Binary Collations are not Case-Sensitive
- When storing alphanumeric codes (e.g. airline confirmation codes, phone numbers, postal codes, etc), binary sorting and comparison is much faster since it ignores all linguistic rules, which is perfect for scenarios such as these where there is no expectation of applying such rules.
- Finding / removing “(null)” character U+0000 /
CHAR(0)
- Microsoft’s implementation of Unicode has incomplete, and some incorrect, sort weights. One example:
- Bugs:
- NVARCHAR to VARCHAR(MAX) can get Msg 8152 “String or binary data would be truncated” — FIXED IN CTP 2.1
- Invalid UTF-8 bytes get (Msg 682, Level 22, State 148) Internal error. Buffer provided to read column value is too small. OR get mysterious, changing 3rd byte of 2-byte value — FIXED IN CTP 2.1
- Inconsistent handling of invalid UTF-8 byte sequences
CHAR() function broken in databases with “_UTF8” default Collation(not a bug after all; just needs better documentation)- An error occurred while executing batch. Error message is: Internal connection fatal error. Error state: 15, Token : 2″ (disconnect) — caused by
NULL
UTF-8 values — FIXED IN CTP 2.1 - Collation Precedence allows for Unicode character data loss concatenating UTF8 with higher precedence non-UTF8 VARCHAR
Recommended Uses / Guidance
The UTF-8 encoding, being a variable-length encoding, can be a huge benefit in some scenarios, but it can also make things worse in others. Unfortunately, there is very little use for a “_UTF8
” encoding given that Data Compression and Clustered Columnstore Indexes are available across all editions of SQL Server. The only scenario that truly benefits from a UTF-8 encoding is one in which all of the following conditions are true:
- Data is mostly standard ASCII (values 0 – 127), but either has, or might have, a small amount of a varying range of Unicode characters (more than would be found on a single 8-bit Code Page, or might not exist on any 8-bit Code Page)
- Column is currently (or otherwise would be)
NVARCHAR(MAX)
(meaning, data won’t fit intoNVARCHAR(4000)
) - There is a lot of data for this column or set of columns (1 GB or more when stored in
NVARCHAR
) - Performance would be negatively impacted by making the table a Clustered Columnstore table (due to how the table is used) OR data is typically < 8000 bytes
- There is no desire to make the column
VARBINARY(MAX)
, useCOMPRESS()
forINSERT
andUPDATE
operations, and useDECOMPRESS()
forSELECT
queries (no need to worry about lack of ability to index theVARBINARY
value since it is MAX data anyway that cannot be indexed). Keep in mind that the Gzipped value will be much smaller than even the UTF-8 version of the string, though it would require decompressing before the value could be filtered on (outside of “=”) or manipulated. - The benefits of reducing the size of backups and reducing the time it takes to backup and restore, and reducing the impact on the buffer pool, outweigh the cost of the likely negative impact on query performance (for both CPU and elapsed times). Just keep in mind that Backup Compression (available in Enterprise and Standard Editions) might help here.
Storing HTML pages is a good example of a scenario that fits this description. UTF-8 is, of course, the preferred encoding for the interwebs precisely due to it using the minimal space for the most common characters while still allowing for the full range of Unicode characters.
Conclusion
False Prophet! Adding support for UTF-8 is certainly interesting. However, being the preferred encoding for the web, and even the default encoding for Linux / Unix (at least some flavors), doesn’t imply that UTF-8 is truly usefull within SQL Server. UTF-8 is a great choice for scenarios where there is:
- a need to be fully compatible with standard ASCII (to work with legacy systems / code)
and / or
a desire to at least save space when most data is standard ASCII anyway,
- a need / desire to support all characters (especially without needing to change Code Pages), and
- no ability to use Unicode Compression.
So, within the context of SQL Server, UTF-8 has the potential for helping a very specific, but not unimportant, use case: NVARCHAR(MAX)
columns, where Unicode Compression doesn’t work. Still, I think the development / testing time spent on this would have been much better applied to something more useful to more users, such as the following (the first two of which would do a better job of accomplishing the space-saving goal of UTF-8, and without sacrificing performance):
- getting Unicode Compression (part of Data Compression) to work on at least in-row values of NVARCHAR(MAX), if not also off-row values (please vote for: “unicode compression nvarchar(max)” — this would save more space than UTF-8)
- getting Data Compression to work for off-row LOB / Row-Overflow values (please vote for: “Page Compression for Out-of-Row LOB Storage” — this would save space for more than just Unicode strings: it would also apply to
VARCHAR
,VARBINARY
, andXML
) - getting mixed alphanumeric values to sort as expected (i.e. “natural sorting” — “21a” sorts after “3a” instead of before — please vote for: “Support "natural sorting" / DIGITSASNUMBERS as a Collation option“)
Given that it is not fun to change Collations for databases, I would strongly recommend against using the new UTF-8 Collations, at least not at the database level, and definitely not in Production. With the obviously missing binary versions, and 5 bugs found in only a couple of days of testing for this post (not testing a full, complex application), this feature is incomplete, and there might yet be more bugs / odd behavior. I would suggest, if it looks like it will take quite a bit of development time to fix these six issues, then perhaps just remove this feature entirely (no big loss as it provides very little, if any, benefit) and spend that time on one of the three items noted directly above (think of how many customers / projects would benefit from Data Compression working for off-row LOB / overflow data).
If you are going to use a UTF-8 Collation on one or more columns, make no assumptions about size reduction or better performance! Be sure to capture current size and performance stats while column is NVARCHAR
, and then compare both after changing to UTF-8.
Resources
- UTF-8 Support in SQL Server 2019
- Data Compression
- Columnstore Indexes
- Built-in GZIP functions
- General Info on Unicode encodings
UPDATE 2018-11-27 / CTP 2.1
CTP 2.1 was released in early November (updated “What’s new” info), and includes:
- a new UTF-8-related capability (i.e. the ability to select a UTF-8 Collation as the Instance-level Collation in the installer)
- 3 bugs fixed (noted with “FIXED IN CTP 2.1” in the bug list above, under “Limitations” > “Uknown”)
- a new bug: Install fails with “Could not find the Database Engine startup handle.” if a Unicode-only Collation is selected
- newly discovered bug (probably existed in CTP 2.0 as well, but I only found it after installing CTP 2.1): Collation Precedence allows for silent truncation (data loss) implicitly converting Extended ASCII characters into UTF8
Quirks
Here are some issues that you might run into when using a UTF-8 Collation at either the Instance level or Database level. These are most likely not bugs since they have been deprecated for a while now. Still, it is possible that you have legacy systems and/or 3rd party applications (some that you cannot change) using these features.
TEXT
/NTEXT
datatypes in temporary tables (when Instance-level Collation is “_UTF8”) and table variables (when Database-level Collation is “_UTF8”):CREATE TABLE [#D'oh] (col1 NTEXT); -- ' (CSS formatting bug work-around) /* Msg 4188, Level 16, State 1, Line XXXXX Column or parameter 'col1' has type 'ntext' and collation 'Japanese_XJIS_140_CI_AI_VSS_UTF8'. The legacy LOB types do not support UTF-8 or UTF-16 encodings. Use types varchar(max), nvarchar(max) or a collation which does not have the _SC or _UTF8 flags. */
Of course, there is an easy work-around to these errors: for columns using either of these datatypes, use the
COLLATE
keyword to specify a non-UTF8 Collation (e.g. “COLLATE Latin1_General_100_CI_AS
“).- sys.fn_trace_gettable: The “TextData” column of the result set uses the
NTEXT
datatype. This should only be an issue when using a UTF-8 Collation at the Instance level. sys.fn_get_sql: The “text” column of the result set uses the
TEXT
datatype (which was wrong even before this function was deprecated since the datatype should have beenNTEXT
😲 ):SELECT * FROM sys.fn_get_sql(0x03); /* Msg 4189, Level 16, State 0, Procedure fn_get_sql, Line XXXXX [Batch Start Line YYYYY] Cannot convert to text/ntext or collate to 'Japanese_XJIS_140_CI_AI_VSS_UTF8' because these legacy LOB types do not support UTF-8 or UTF-16 encodings. Use types varchar(max), nvarchar(max) or a collation which does not have the _SC or _UTF8 flags. Msg 4413, Level 16, State 1, Line XXXXX Could not use view or function 'fn_get_sql' because of binding errors. */
This should only be an issue when using a UTF-8 Collation at the Instance level.
You can use the following query to check your system to see if there are any columns using either the TEXT
or NTEXT
datatypes. This does not check either stored procedure / trigger / function / SQL Agent T-SQL job steps content to see if these datatypes are being used in temporary tables or table variables or permanent tables being created or altered, nor does it check T-SQL code submitted by anything outside of SQL Server (such as your app code). So be sure to test, test, test (which you were going to do anyway, right?).
Now, to be fair, these same quirks exist when using any Supplementary Character-Aware Collation (i.e. one with either “_SC
” or “_140_
” in its name) at the Instance or Database level.
Please note that the following query only reports objects for the database in which it is being executed (also, the two deprecated functions mentioned above will show up in all databases as they are global).
SELECT OBJECT_SCHEMA_NAME(ac.[object_id]) AS [SchemaName], ao.[name] AS [ObjectName], ao.[type_desc] AS [ObjectType], ac.[name] AS [ColumnName], ac.[system_type_id], ac.[collation_name] FROM sys.all_columns ac INNER JOIN sys.all_objects ao ON ao.[object_id] = ac.[object_id] WHERE ac.[system_type_id] IN (35, 99); -- 35 = TEXT ; 99 = NTEXT
In-memory OLTP
The documentation states that the “_UTF8” Collations do not work with In-memory OLTP. We can easily confirm that via the following tests.
This:
CREATE TABLE dbo.UTF8 ( [ID] INT IDENTITY(1, 1) NOT NULL PRIMARY KEY NONCLUSTERED HASH WITH (BUCKET_COUNT = 400), [Name] VARCHAR(50) COLLATE Latin1_General_100_CI_AS_SC_UTF8 NOT NULL ) WITH (MEMORY_OPTIMIZED = ON);
and this:
CREATE TABLE dbo.UTF8 ( [ID] INT IDENTITY(1, 1) NOT NULL PRIMARY KEY NONCLUSTERED HASH WITH (BUCKET_COUNT = 400), [Name] VARCHAR(50) COLLATE Latin1_General_100_CI_AS_SC -- NOT UTF8 NOT NULL ) WITH (MEMORY_OPTIMIZED = ON); -- Success ALTER TABLE dbo.[UTF8] ALTER COLUMN [Name] VARCHAR(50) COLLATE Latin1_General_100_CI_AS_SC_UTF8 NOT NULL;
both get the following error:
Msg 12356, Level 16, State 157, Line XXXXX
Comparison, sorting, and manipulation of character strings that use a UTF8 collation is not supported with memory optimized tables.
And this statement (same as the first CREATE
statement directly above, but including an INDEX
on the [Name]
column):
CREATE TABLE dbo.UTF8 ( [ID] INT IDENTITY(1, 1) NOT NULL PRIMARY KEY NONCLUSTERED HASH WITH (BUCKET_COUNT = 400), [Name] VARCHAR(50) COLLATE Latin1_General_100_CI_AS_SC_UTF8 NOT NULL INDEX [IX_UTF8_Name] NONCLUSTERED ) WITH (MEMORY_OPTIMIZED = ON);
gets the following error:
Msg 12357, Level 16, State 158, Line XXXXX
Indexes on character columns that use a UTF8 collation are not supported with indexes on memory optimized tables.
And, attempting to change a Database’s default Collation to be a “_UTF8
” Collation will error:
ALTER DATABASE [Hekaton] COLLATE Latin1_General_100_CI_AS_SC_UTF8;
receives:
Msg 41335, Level 16, State 3, Line XXXXX
Modifying the collation of a database is not allowed when the database contains memory optimized tables or natively compiled modules.
(that error message is not entirely accurate since you can change the Collation to a non-UTF-8 Collation, though you still might need to drop natively compiled modules first.)
HOWEVER, using the “sqlservr -q
” method of changing all Collations for an Instance allows us to force a UTF-8 Collation on a Database containing memory-optimized tables, and even on columns in those memory-optimized tables:
sqlservr.exe -c -m -T4022 -T3659 -s"{instance_name}" ^ -q"Latin1_General_100_CI_AS_SC_UTF8"
After this (unsupported) change, the Database generally works, but there are a few definite issues:
- Data corruption in
VARCHAR
columns for characters with values 128 and above (i.e. outside of the standard ASCII set that ends at value 127). This affects “durable” memory-optimized tables (i.e.DURABILITY = SCHEMA_AND_DATA
). For example, if the original Collation isKorean_100_CI_AS
, then the byte sequence of “0xB15A
” will produce “켣”. If you then use this method to switch to any “_UTF8
” Collation, you will end up seeing “�Z” because “0xB1
” is an invalid byte sequence in UTF-8, and “0x5A
” is valid and is the “Z
” character. - Indexes on memory-optimized tables will not be dropped and recreated, and so in many cases they will be out of order (i.e. in the wrong order). For example, switching from
Latin1_General_100_CI_AS_SC
toLatin1_General_100_CI_AS_SC_UTF8
(only difference is adding “_UTF8
” to the end of the original Collation) would cause indexes to become out of order, but only if they have key columns that areVARCHAR
with data containing characters of values 128 or higher. (INCLUDE
columns that areVARCHAR
will have corrupted data for characters of values 128 or higher, but this does not affect the order.) - Related to out-of-order indexes: a simple
SELECT
works, even when adding inORDER BY
orGROUP BY
. Adding aWHERE
clause will also work, as long as no new rows have been added. Once a new row has been added, using aWHERE
clause referencing an indexed string column will result in the following error:Msg 9100, Level 21, State 2, Line XXXX
Possible index corruption detected. Run DBCC CHECKDB.
And you will be disconnected. OR, after adding a row, you might be able to successfully use a
WHERE
clause, but the first time you use theORDER BY
clause you will get the following error:Msg 701, Level 17, State 153, Line XXXXX
There is insufficient system memory in resource pool 'default' to run this query.
After that, you will probably get the “Possible index corruption detected” error, even when using just the
WHERE
clause.
Keep in mind that indexes on memory-optimized tables cannot be rebuilt or even dropped. Fixing these issues requires dropping, recreating, and reloading any affected tables. Again, this (using UTF-8 Collations with memory-optimized tables) is not a supported configuration.
UPDATE 2018-12-26 / CTP 2.2
CTP 2.2 was released in early December (updated “What’s new” info), and includes:
- a new UTF-8-related capability (i.e. the ability to use UTF-8 Collations with Replication)
- 0 reported bugs fixed
- 2 new bugs found (or at least reported, though they were likely present prior to this CTP) by Erland Sommarskog:
- 6 definite bugs, and 1 potential bug, still to fix (to be fair, I have been told that they are working on a binary UTF-8 collation, so that is at least something)
Current Assessement
As of CTP 2.2, my overall assessment of this new feature has not changed: native UTF-8 support is a waste of time and effort / wasted opportunity considering:
- this really only benefits a small subset of use cases / customers
- this is a storage format; operations requiring linguistic rules and/or normilization are still handled in UTF-16 which requires converting in memory each time
- operations requirining in-memory conversion to UTF-16 are slower (generally) than using
NVARCHAR
- there is a very large potential here for customers to hurt their systems by misunderstanding the appropriate uses and drawbacks of UTF-8, and applying it to data that will end up taking more space and/or will be an unneccesary performance hit.
- expanding data compression to work with
NVARCHAR(MAX)
(ideally including off-row data, but even if only working with in-row data) would:- apply to many more use-cases
- would do better data compression than the best-case for UTF-8
- would not have the performance hit
- would not require any re-coding of apps
- would not introduce so many bugs
- would not have the potential for customers to accidentally bloat their systems or degrade performance without even getting any space savings
Yet, even if they remove the new UTF-8 feature (which they should), they now don’t have any time left to implement the appropriate feature (especially considering that removing this feature will take development and testing time). That is why this is such an unfortunate wasted opportunity. 😿
- According to Unicode Technical Note #14: A SURVEY OF UNICODE COMPRESSION (the linked PDF in this link):
Because of the 1-byte representation of ASCII, many English-speaking users have come to regard UTF-8 as a compression format. This notion is puzzling to speakers of French and other Latin-script languages that use many non-ASCII characters, and ridiculous to users of almost every other script on Earth, for which UTF-8 uses at least one byte per character more than legacy encodings.Although UTF-8 is indeed “not extravagant” in size, there is a big difference between avoiding wastefulness and achieving compression. The primary reason to use UTF-8 should be to maintain ASCII transparency, not to achieve compression.
↩ - According to Unicode Technical Note #12: UTF-16 FOR PROCESSING, Section 3: “UTF-8: Good for Compatibility”:
UTF-8 was mainly designed to store Unicode filenames in an ASCII-friendly way. It is suitable for processing, but it is significantly more complex to process than UTF-16……UTF-8 stores Latin text in a compact form compared to UTF-16 but does not provide any advantages for other scripts or even uses more space. There are other Unicode charsets (for example, SCSU and BOCU-1) that are more efficient for storage and data exchange.
Please note that “SCSU” is the “Standard Compression Scheme for Unicode“, and “BOCU-1” is the “MIME-compatible application of the Binary Ordered Compression for Unicode (BOCU) algorithm.