(last updated: 2019-03-18 @ 14:55 EST / 2019-03-18 @ 18:55 UTC )
IntelliSense in SQL Server Management Studio (SSMS) can be quite helpful. It can save time typing by presenting possible object or function names. And, it can even show you the signatures (i.e. input and output parameters / return types) of stored procedures and functions.
Another thing that IntelliSense does, even though it was certainly not intentional, is show you the names of (some) undocumented functions. Many of these undocumented functions won’t let you execute them, so not much to investigate there1 . But, several of these undocumented built-in functions can be executed. One of them is UNCOMPRESS
. This function is not to be confused with DECOMPRESS, the companion to COMPRESS, which are actually GUnzip and GZip, respectively (and were introduced in SQL Server 2016, if you haven’t seen them before).
No, UNCOMPRESS
does something entirely different, but the only documentation is the function signature that appears in SSMS:
built-in function UNCOMPRESS(Param1 varbinary(1)) RETURNS nvarchar(1)
I did a bit of searching around, and all I could find were a couple of references to using it on the ctext
column in sys.syscomments
, but only if the status
column (a bit-masked value) had the “2” bit set (i.e. status & 2 = 2
). The first two entries in the “Mentions of UNCOMPRESS()” section at the end of this post are books that contain this same info. With only that one clue to go on, I used the following query to find some data that was meant to be passed into this function:
SELECT * FROM sys.syscomments WHERE [compressed] = 1 OR [status] & 2 = 2;
I tried in [master]
, [msdb]
, and even [MSSQLSystemResource]
(in single-user mode) but no rows were ever returned (I tested in SQL Server 2017). Between being used for a deprecated (as of SQL Server 2005) system compatibility view, and that view not even returning rows that would make use of this function, it seems safe to conclude that this function is obsolete in addition to being undocumented.
Why Document?
Why document an undocumented, unsupported, and possibly obsolete function? Well:
- because it was a puzzle to solve, and
- because it’s undocumented we don’t have much (or any) info on it, AND because it shows up in SSMS IntelliSense people can find it, AND because it can be executed, people might attempt to use it in their code. Therefore, it’s important to understand how it works and why we shouldn’t use it (aside from it being “undocumented”, which means unsupported, which is enough to convince some folks, but not everyone).
What Now?
With no clear indications of what the UNCOMPRESS
function does, we can at least pass in some simple values to see what comes back, and see if we can make sense of the output. For the following tests, please keep in mind that “8-bit” refers to the VARCHAR
, CHAR
, and TEXT
(deprecated) datatypes. And, “16-bit” refers to the NVARCHAR
, NCHAR
, NTEXT
(deprecated),and XML
datatypes.
Single Character Tests
We will start with the easiest test by passing in one character.
The first query passes a VARCHAR
upper-case “A” (having a value of 0x41
) into UNCOMPRESS
, and gets back the same character, but with an extra byte of 0x00
added on. This should make sense since this function returns NVARCHAR
, which is UTF-16 (characters are either 2 bytes or 4 bytes). The Unicode Code Point is actually U+0041, but SQL Server / Windows / .NET use Little Endian, so the bytes are in reverse order, hence 4100
2 . At this point, the UNCOMPRESS
function is doing just what the CONVERT
function does, so it seems a little redundant.
The second query passes an NVARCHAR
upper-case “B” (having a value of 0x4200
) into UNCOMPRESS
, and gets back the same character, but this time with an extra two bytes of 0x00
added on. This is not what the CONVERT
function does!
-- 8-bit test 1: SELECT '8-bit' AS [Encoding], CONVERT(VARBINARY(10), 'A') AS [Hex"A"], UNCOMPRESS(CONVERT(VARBINARY(10), 'A')) AS [Uncompressed"A"], CONVERT(VARBINARY(10), UNCOMPRESS(CONVERT(VARBINARY(10), 'A'))) AS [HexUncompressed"A"]; /* Encoding Hex"A" Uncompressed"A" HexUncompressed"A" 8-bit 0x41 A 0x4100 */-- 16-bit test 1: SELECT '16-bit' AS [Encoding], CONVERT(VARBINARY(10), N'B') AS [Hex"B"], UNCOMPRESS(CONVERT(VARBINARY(10), N'B')) AS [Uncompressed"B"], CONVERT(VARBINARY(10), UNCOMPRESS(CONVERT(VARBINARY(10), N'B'))) AS [HexUncompressed"B"], CONVERT(VARBINARY(10), CONVERT(NVARCHAR(5), CONVERT(VARBINARY(10), N'B'))) AS [HexConverted"B"]; /* Encoding Hex"B" Uncompressed"B" HexUncompressed"B" HexConverted"B" 16-bit 0x4200 B 0x42000000 0x4200 */
Two Character Tests
The next set of tests will pass in two characters.
The first query again shows that each VARCHAR
byte comes back as the same character, but in the expected UTF-16 LE (Little Endian) encoding.
The second query, passing in two NVARCHAR
characters, helps to clarify two things:
- Every single byte going into
UNCOMPRESS
comes back as UTF-16 LE (with the extra0x00
byte added on). Hence, passing in a character that is already in UTF-16 LE encoding (e.g. “D” being the two bytes0x44
and0x00
), will have each of its two bytes converted into UTF-16 LE, leaving us with0x4400
and0x0000
, or0x44000000
(as you can see in the “HexUncompressed"DD"” field). - Even though strings can store character “0” (i.e. U+0000,
(null)
,\0
,0x00
,CHAR(0)
,NCHAR(0)
,\x00
, and so on), they cannot be displayed, nor can any character after them being displayed (as you can see in the “Uncompressed"DD"” field). This is most likely due to character “0” being the string-terminator in many languages.
-- 8-bit test 2: SELECT '8-bit' AS [Encoding], CONVERT(VARBINARY(10), 'CC') AS [Hex"CC"], UNCOMPRESS(CONVERT(VARBINARY(10), 'CC')) AS [Uncompressed"CC"], CONVERT(VARBINARY(10), UNCOMPRESS(CONVERT(VARBINARY(10), 'CC'))) AS [HexUncompressed"CC"]; /* Encoding Hex"CC" Uncompressed"CC" HexUncompressed"CC" 8-bit 0x4343 CC 0x43004300 */-- 16-bit test 2: SELECT '16-bit' AS [Encoding], CONVERT(VARBINARY(10), N'DD') AS [Hex"DD"], UNCOMPRESS(CONVERT(VARBINARY(10), N'DD')) AS [Uncompressed"DD"], CONVERT(VARBINARY(10), UNCOMPRESS(CONVERT(VARBINARY(10), N'DD'))) AS [HexUncompressed"DD"]; /* Encoding Hex"DD" Uncompressed"DD" HexUncompressed"DD" 16-bit 0x44004400 D 0x4400000044000000 */
Confirmation
Just to make sure that UNCOMPRESS
only works with single-byte characters, let’s pass in something besides two “C”s or two “D”s: Unicode Code Point U+4234 (i.e. 䈴).
SELECT NCHAR(0x4234), -- 䈴 UNCOMPRESS(0x4234), -- B4 NCHAR(0x42), -- B NCHAR(0x34); -- 4
Which Bytes Are Converted to UTF-16?
Now that we know that we are dealing with single-byte characters, which single-byte characters specifically are they? Are they VARCHAR
characters of various code pages? Are they VARCHAR
characters from one particular code page? Are they NVARCHAR
/ UTF-16 characters in the U+0000 through U+00FF range that all have a trailing byte of 0x00
? Something else perhaps?
Given that characters with values in the range of 0 – 127 (decimal) / 0x00 – 0x7F (hex) are the same across all code pages / encodings that can be represented in SQL Server, only testing with those (i.e. US English, digits 0 – 9, and some punctuation) often hides / obscures important functional differences. So, we need to test values 128 – 255 / 0x80 – 0xFF across several different code pages / encodings.
Create and Populate Table
The following queries will set up the test data that we need to see (or at least confirm) what is actually happening. Code page 1252 is Latin1 (we are looking at this because it’s used in several collations: anything with “Latin1_General” in the name, French, etc), and code page 1255 is Hebrew (which is distinctly different from 1252, so it will be easy to see differences). Finally, UTF-16 is the encoding used by NVARCHAR
data. For each row, we are inserting a single byte in the range of 0x00
– 0xFF
into each column. We can then easily compare the resulting character of each byte with the output of UNCOMPRESS
.
-- DROP TABLE #Uncompress; CREATE TABLE #Uncompress ( [ValueHex] BINARY(1) NOT NULL PRIMARY KEY, [Value] AS (CONVERT(TINYINT, [ValueHex])), [CP1252] VARCHAR(10) COLLATE Latin1_General_100_CI_AS_SC, [CP1255] VARCHAR(10) COLLATE Hebrew_100_CI_AS_SC, [UTF16] NVARCHAR(10) COLLATE Latin1_General_100_CI_AS_SC ); SET NOCOUNT ON; ;WITH cte AS ( SELECT TOP (256) CONVERT(BINARY(1), ROW_NUMBER() OVER (ORDER BY (SELECT 0)) - 1) AS [num] FROM master.sys.all_columns ) INSERT INTO #Uncompress ([ValueHex], [CP1252], [CP1255], [UTF16]) SELECT num, num, num, num FROM cte;
View the Evidence
The query below will show us the character that each byte represents in each of the three encodings. It also feeds that same byte to the UNCOMPRESS
function, and shows the underlying byte representation of each character after that byte is stored in the NVARCHAR
column and passed into the UNCOMPRESS
function. And, because the characters for each byte in the range of 0x00
– 0x7F
are the same across the encodings, the query only returns the 0x80
– 0xFF
range (you can easily comment out the WHERE
clause to see the boring 0x00
– 0x7F
range).
There are additional notes below the query.
SELECT chr.[Value], chr.[ValueHex], chr.[CP1252], CONVERT(BINARY(2), UNICODE(chr.[CP1252])) AS [CP1252], chr.[CP1255], CONVERT(BINARY(2), UNICODE(chr.[CP1255])) AS [CP1255], chr.[UTF16], CONVERT(BINARY(2), UNICODE(chr.[UTF16])) AS [UTF16], UNCOMPRESS(chr.[ValueHex]) AS [UNCOMPRESS], CONVERT(BINARY(2), chr.[UTF16]) AS [UTF16bytes], CONVERT(BINARY(2), UNCOMPRESS(chr.[ValueHex])) AS [UNCOMPRESSbytes] FROM #Uncompress chr WHERE chr.[Value] > 127 ORDER BY chr.[Value];
In the table shown below, please note:
- The “CP 1252”, “CP 1255”, and “UTF-16” columns are in the form of “Character (Unicode Code Point)”.
- The right-most two columns — “UTF-16 bytes” and “UNCOMPRESS bytes” — show the actual bytes as they exist in the system, in Little Endian. So, Code Point U+0081 would be represented by the bytes
0x8100
. - The Unicode Code Point is shown for each character so it will be easier to see which characters are exactly the same as the “UNCOMPRESS” column, especially for the columns that have non-printable characters.
- Code Points are shown instead of underlying byte value since the underlying byte value is already shown in the two left-most columns. Meaning, the byte
0x0E
is the character “à” on code page 1252, while that same byte is the character “א” on code page 1255.
Value (Hex) | CP 1252 | CP 1255 | UTF-16 | UNCOMPRESS | UTF-16 bytes | UNCOMPRESS bytes | |||
---|---|---|---|---|---|---|---|---|---|
128 (0x80) | € | (20AC) | € | (20AC) | (0080) | 8000 | 8000 | ||
129 (0x81) | (0081) | (0081) | (0081) | 8100 | 8100 | ||||
130 (0x82) | ‚ | (201A) | ‚ | (201A) | (0082) | 8200 | 8200 | ||
131 (0x83) | ƒ | (0192) | ƒ | (0192) | (0083) | 8300 | 8300 | ||
132 (0x84) | „ | (201E) | „ | (201E) | (0084) | 8400 | 8400 | ||
133 (0x85) | … | (2026) | … | (2026) | (0085) | 8500 | 8500 | ||
134 (0x86) | † | (2020) | † | (2020) | (0086) | 8600 | 8600 | ||
135 (0x87) | ‡ | (2021) | ‡ | (2021) | (0087) | 8700 | 8700 | ||
136 (0x88) | ˆ | (02C6) | ˆ | (02C6) | (0088) | 8800 | 8800 | ||
137 (0x89) | ‰ | (2030) | ‰ | (2030) | (0089) | 8900 | 8900 | ||
138 (0x8A) | Š | (0160) | (008A) | (008A) | 8A00 | 8A00 | |||
139 (0x8B) | ‹ | (2039) | ‹ | (2039) | (008B) | 8B00 | 8B00 | ||
140 (0x8C) | Œ | (0152) | (008C) | (008C) | 8C00 | 8C00 | |||
141 (0x8D) | (008D) | (008D) | (008D) | 8D00 | 8D00 | ||||
142 (0x8E) | Ž | (017D) | (008E) | (008E) | 8E00 | 8E00 | |||
143 (0x8F) | (008F) | (008F) | (008F) | 8F00 | 8F00 | ||||
144 (0x90) | (0090) | (0090) | (0090) | 9000 | 9000 | ||||
145 (0x91) | ‘ | (2018) | ‘ | (2018) | (0091) | 9100 | 9100 | ||
146 (0x92) | ’ | (2019) | ’ | (2019) | (0092) | 9200 | 9200 | ||
147 (0x93) | “ | (201C) | “ | (201C) | (0093) | 9300 | 9300 | ||
148 (0x94) | ” | (201D) | ” | (201D) | (0094) | 9400 | 9400 | ||
149 (0x95) | • | (2022) | • | (2022) | (0095) | 9500 | 9500 | ||
150 (0x96) | – | (2013) | – | (2013) | (0096) | 9600 | 9600 | ||
151 (0x97) | — | (2014) | — | (2014) | (0097) | 9700 | 9700 | ||
152 (0x98) | ˜ | (02DC) | ˜ | (02DC) | (0098) | 9800 | 9800 | ||
153 (0x99) | (2122) | (2122) | (0099) | 9900 | 9900 | ||||
154 (0x9A) | š | (0161) | (009A) | (009A) | 9A00 | 9A00 | |||
155 (0x9B) | › | (203A) | › | (203A) | (009B) | 9B00 | 9B00 | ||
156 (0x9C) | œ | (0153) | (009C) | (009C) | 9C00 | 9C00 | |||
157 (0x9D) | (009D) | (009D) | (009D) | 9D00 | 9D00 | ||||
158 (0x9E) | ž | (017E) | (009E) | (009E) | 9E00 | 9E00 | |||
159 (0x9F) | Ÿ | (0178) | (009F) | (009F) | 9F00 | 9F00 | |||
160 (0xA0) | (00A0) | (00A0) | (00A0) | A000 | A000 | ||||
161 (0xA1) | ¡ | (00A1) | ¡ | (00A1) | ¡ | (00A1) | ¡ | A100 | A100 |
162 (0xA2) | ¢ | (00A2) | ¢ | (00A2) | ¢ | (00A2) | ¢ | A200 | A200 |
163 (0xA3) | £ | (00A3) | £ | (00A3) | £ | (00A3) | £ | A300 | A300 |
164 (0xA4) | ¤ | (00A4) | ₪ | (20AA) | ¤ | (00A4) | ¤ | A400 | A400 |
165 (0xA5) | ¥ | (00A5) | ¥ | (00A5) | ¥ | (00A5) | ¥ | A500 | A500 |
166 (0xA6) | ¦ | (00A6) | ¦ | (00A6) | ¦ | (00A6) | ¦ | A600 | A600 |
167 (0xA7) | § | (00A7) | § | (00A7) | § | (00A7) | § | A700 | A700 |
168 (0xA8) | ¨ | (00A8) | ¨ | (00A8) | ¨ | (00A8) | ¨ | A800 | A800 |
169 (0xA9) | © | (00A9) | © | (00A9) | © | (00A9) | © | A900 | A900 |
170 (0xAA) | ª | (00AA) | × | (00D7) | ª | (00AA) | ª | AA00 | AA00 |
171 (0xAB) | « | (00AB) | « | (00AB) | « | (00AB) | « | AB00 | AB00 |
172 (0xAC) | ¬ | (00AC) | ¬ | (00AC) | ¬ | (00AC) | ¬ | AC00 | AC00 |
173 (0xAD) | | (00AD) | | (00AD) | | (00AD) | | AD00 | AD00 |
174 (0xAE) | ® | (00AE) | ® | (00AE) | ® | (00AE) | ® | AE00 | AE00 |
175 (0xAF) | ¯ | (00AF) | ¯ | (00AF) | ¯ | (00AF) | ¯ | AF00 | AF00 |
176 (0xB0) | ° | (00B0) | ° | (00B0) | ° | (00B0) | ° | B000 | B000 |
177 (0xB1) | ± | (00B1) | ± | (00B1) | ± | (00B1) | ± | B100 | B100 |
178 (0xB2) | ² | (00B2) | ² | (00B2) | ² | (00B2) | ² | B200 | B200 |
179 (0xB3) | ³ | (00B3) | ³ | (00B3) | ³ | (00B3) | ³ | B300 | B300 |
180 (0xB4) | ´ | (00B4) | ´ | (00B4) | ´ | (00B4) | ´ | B400 | B400 |
181 (0xB5) | µ | (00B5) | µ | (00B5) | µ | (00B5) | µ | B500 | B500 |
182 (0xB6) | ¶ | (00B6) | ¶ | (00B6) | ¶ | (00B6) | ¶ | B600 | B600 |
183 (0xB7) | · | (00B7) | · | (00B7) | · | (00B7) | · | B700 | B700 |
184 (0xB8) | ¸ | (00B8) | ¸ | (00B8) | ¸ | (00B8) | ¸ | B800 | B800 |
185 (0xB9) | ¹ | (00B9) | ¹ | (00B9) | ¹ | (00B9) | ¹ | B900 | B900 |
186 (0xBA) | º | (00BA) | ÷ | (00F7) | º | (00BA) | º | BA00 | BA00 |
187 (0xBB) | » | (00BB) | » | (00BB) | » | (00BB) | » | BB00 | BB00 |
188 (0xBC) | ¼ | (00BC) | ¼ | (00BC) | ¼ | (00BC) | ¼ | BC00 | BC00 |
189 (0xBD) | ½ | (00BD) | ½ | (00BD) | ½ | (00BD) | ½ | BD00 | BD00 |
190 (0xBE) | ¾ | (00BE) | ¾ | (00BE) | ¾ | (00BE) | ¾ | BE00 | BE00 |
191 (0xBF) | ¿ | (00BF) | ¿ | (00BF) | ¿ | (00BF) | ¿ | BF00 | BF00 |
192 (0xC0) | À | (00C0) | ְ | (05B0) | À | (00C0) | À | C000 | C000 |
193 (0xC1) | Á | (00C1) | ֱ | (05B1) | Á | (00C1) | Á | C100 | C100 |
194 (0xC2) | Â | (00C2) | ֲ | (05B2) | Â | (00C2) | Â | C200 | C200 |
195 (0xC3) | Ã | (00C3) | ֳ | (05B3) | Ã | (00C3) | Ã | C300 | C300 |
196 (0xC4) | Ä | (00C4) | ִ | (05B4) | Ä | (00C4) | Ä | C400 | C400 |
197 (0xC5) | Å | (00C5) | ֵ | (05B5) | Å | (00C5) | Å | C500 | C500 |
198 (0xC6) | Æ | (00C6) | ֶ | (05B6) | Æ | (00C6) | Æ | C600 | C600 |
199 (0xC7) | Ç | (00C7) | ַ | (05B7) | Ç | (00C7) | Ç | C700 | C700 |
200 (0xC8) | È | (00C8) | ָ | (05B8) | È | (00C8) | È | C800 | C800 |
201 (0xC9) | É | (00C9) | ֹ | (05B9) | É | (00C9) | É | C900 | C900 |
202 (0xCA) | Ê | (00CA) | ֺ | (05BA) | Ê | (00CA) | Ê | CA00 | CA00 |
203 (0xCB) | Ë | (00CB) | ֻ | (05BB) | Ë | (00CB) | Ë | CB00 | CB00 |
204 (0xCC) | Ì | (00CC) | ּ | (05BC) | Ì | (00CC) | Ì | CC00 | CC00 |
205 (0xCD) | Í | (00CD) | ֽ | (05BD) | Í | (00CD) | Í | CD00 | CD00 |
206 (0xCE) | Î | (00CE) | ־ | (05BE) | Î | (00CE) | Î | CE00 | CE00 |
207 (0xCF) | Ï | (00CF) | ֿ | (05BF) | Ï | (00CF) | Ï | CF00 | CF00 |
208 (0xD0) | Ð | (00D0) | ׀ | (05C0) | Ð | (00D0) | Ð | D000 | D000 |
209 (0xD1) | Ñ | (00D1) | ׁ | (05C1) | Ñ | (00D1) | Ñ | D100 | D100 |
210 (0xD2) | Ò | (00D2) | ׂ | (05C2) | Ò | (00D2) | Ò | D200 | D200 |
211 (0xD3) | Ó | (00D3) | ׃ | (05C3) | Ó | (00D3) | Ó | D300 | D300 |
212 (0xD4) | Ô | (00D4) | װ | (05F0) | Ô | (00D4) | Ô | D400 | D400 |
213 (0xD5) | Õ | (00D5) | ױ | (05F1) | Õ | (00D5) | Õ | D500 | D500 |
214 (0xD6) | Ö | (00D6) | ײ | (05F2) | Ö | (00D6) | Ö | D600 | D600 |
215 (0xD7) | × | (00D7) | ׳ | (05F3) | × | (00D7) | × | D700 | D700 |
216 (0xD8) | Ø | (00D8) | ״ | (05F4) | Ø | (00D8) | Ø | D800 | D800 |
217 (0xD9) | Ù | (00D9) | | (F88D) | Ù | (00D9) | Ù | D900 | D900 |
218 (0xDA) | Ú | (00DA) | | (F88E) | Ú | (00DA) | Ú | DA00 | DA00 |
219 (0xDB) | Û | (00DB) | | (F88F) | Û | (00DB) | Û | DB00 | DB00 |
220 (0xDC) | Ü | (00DC) | | (F890) | Ü | (00DC) | Ü | DC00 | DC00 |
221 (0xDD) | Ý | (00DD) | | (F891) | Ý | (00DD) | Ý | DD00 | DD00 |
222 (0xDE) | Þ | (00DE) | | (F892) | Þ | (00DE) | Þ | DE00 | DE00 |
223 (0xDF) | ß | (00DF) | | (F893) | ß | (00DF) | ß | DF00 | DF00 |
224 (0xE0) | à | (00E0) | א | (05D0) | à | (00E0) | à | E000 | E000 |
225 (0xE1) | á | (00E1) | ב | (05D1) | á | (00E1) | á | E100 | E100 |
226 (0xE2) | â | (00E2) | ג | (05D2) | â | (00E2) | â | E200 | E200 |
227 (0xE3) | ã | (00E3) | ד | (05D3) | ã | (00E3) | ã | E300 | E300 |
228 (0xE4) | ä | (00E4) | ה | (05D4) | ä | (00E4) | ä | E400 | E400 |
229 (0xE5) | å | (00E5) | ו | (05D5) | å | (00E5) | å | E500 | E500 |
230 (0xE6) | æ | (00E6) | ז | (05D6) | æ | (00E6) | æ | E600 | E600 |
231 (0xE7) | ç | (00E7) | ח | (05D7) | ç | (00E7) | ç | E700 | E700 |
232 (0xE8) | è | (00E8) | ט | (05D8) | è | (00E8) | è | E800 | E800 |
233 (0xE9) | é | (00E9) | י | (05D9) | é | (00E9) | é | E900 | E900 |
234 (0xEA) | ê | (00EA) | ך | (05DA) | ê | (00EA) | ê | EA00 | EA00 |
235 (0xEB) | ë | (00EB) | כ | (05DB) | ë | (00EB) | ë | EB00 | EB00 |
236 (0xEC) | ì | (00EC) | ל | (05DC) | ì | (00EC) | ì | EC00 | EC00 |
237 (0xED) | í | (00ED) | ם | (05DD) | í | (00ED) | í | ED00 | ED00 |
238 (0xEE) | î | (00EE) | מ | (05DE) | î | (00EE) | î | EE00 | EE00 |
239 (0xEF) | ï | (00EF) | ן | (05DF) | ï | (00EF) | ï | EF00 | EF00 |
240 (0xF0) | ð | (00F0) | נ | (05E0) | ð | (00F0) | ð | F000 | F000 |
241 (0xF1) | ñ | (00F1) | ס | (05E1) | ñ | (00F1) | ñ | F100 | F100 |
242 (0xF2) | ò | (00F2) | ע | (05E2) | ò | (00F2) | ò | F200 | F200 |
243 (0xF3) | ó | (00F3) | ף | (05E3) | ó | (00F3) | ó | F300 | F300 |
244 (0xF4) | ô | (00F4) | פ | (05E4) | ô | (00F4) | ô | F400 | F400 |
245 (0xF5) | õ | (00F5) | ץ | (05E5) | õ | (00F5) | õ | F500 | F500 |
246 (0xF6) | ö | (00F6) | צ | (05E6) | ö | (00F6) | ö | F600 | F600 |
247 (0xF7) | ÷ | (00F7) | ק | (05E7) | ÷ | (00F7) | ÷ | F700 | F700 |
248 (0xF8) | ø | (00F8) | ר | (05E8) | ø | (00F8) | ø | F800 | F800 |
249 (0xF9) | ù | (00F9) | ש | (05E9) | ù | (00F9) | ù | F900 | F900 |
250 (0xFA) | ú | (00FA) | ת | (05EA) | ú | (00FA) | ú | FA00 | FA00 |
251 (0xFB) | û | (00FB) | | (F894) | û | (00FB) | û | FB00 | FB00 |
252 (0xFC) | ü | (00FC) | | (F895) | ü | (00FC) | ü | FC00 | FC00 |
253 (0xFD) | ý | (00FD) | | (200E) | ý | (00FD) | ý | FD00 | FD00 |
254 (0xFE) | þ | (00FE) | | (200F) | þ | (00FE) | þ | FE00 | FE00 |
255 (0xFF) | ÿ | (00FF) | | (F896) | ÿ | (00FF) | ÿ | FF00 | FF00 |
Do Large Values / LOBs Work?
We now know what this function does to the bytes that are passed in. But how many bytes are allowed to be passed in? Some functions do not work with the MAX
types. Could this be one of them? That should be easy to find out:
DECLARE @LargeValue VARBINARY(8000); SET @LargeValue = CONVERT(VARBINARY(8000), REPLICATE('a', 4000)); SELECT DATALENGTH(@LargeValue) AS [InputSize], DATALENGTH(UNCOMPRESS(@LargeValue)) AS [OutputSize]; -- InputSize OutputSize -- 4000 8000 SET @LargeValue = CONVERT(VARBINARY(8000), REPLICATE('a', 4001)); SELECT DATALENGTH(@LargeValue) AS [InputSize], DATALENGTH(UNCOMPRESS(@LargeValue)) AS [OutputSize]; -- InputSize OutputSize -- 4001 NULL
Nope. While the input type is not a MAX
type, the output type would have needed to have been NVARCHAR(MAX)
in order to have held the 8002 bytes taken up by the 4001 characters.
Microsoft Access 2000 “Unicode Compression”
Mitch Schroeter suggested to me that perhaps the UNCOMPRESS
function was intended to work on data coming directly from Access 2000 (or newer) and compressed via the WITH COMPRESSION
option of the CREATE TABLE statement. The documentation for CREATE TABLE
(for Microsoft Access, not SQL Server) states the following towards the end of the Remarks section:
The WITH COMPRESSION attribute was added for CHARACTER columns because of the change to the Unicode character representation format. Unicode characters uniformly require two bytes for each character. For existing Microsoft Jet databases that contain predominately character data, this could mean that the database file would nearly double in size when converted to the Microsoft Access database engine format. However, Unicode representation of many character sets, those formerly denoted as Single-Byte Character Sets (SBCS), can easily be compressed to a single byte. If you define a CHARACTER column with this attribute, data will automatically be compressed as it is stored and uncompressed when retrieved from the column.
While this does sound similar, it is not the exact same compression that the UNCOMPRESS
function expects. There is some overlap in the behavior, but the UNCOMPRESS
function is more simplistic than Access’s “Unicode Compression” (that term is in quotes because it is not true Unicode Compression). If Access was doing nothing more than removing the “0x00” bytes, then there would be no way to determine when to add them back in upon uncompressing; very few of the 65,536 two-byte code points have 0x00
bytes, so any algorithm will need to deal with non-compressible code points. There needs to be an indicator of some sort to tell the parser when a byte should be prefixed with a 0x00
, appended with 0x00
, or left alone. For example, if it encounters two bytes — 0xD5E2
— should the next two bytes of output be: 0x00D5
, 0xD500
, or 0xD5E2
? We need more info to figure this out.
I found a technical description of the algorithm used to do the Microsoft Access style “Unicode Compression” and the resulting format. It’s described in this ticket for a Java-based MS Access library:
https://sourceforge.net/p/jackcess/bugs/111/
But that code at the top is from another library, libmdb, whereas the ticket is for Jackcess. The Jackcess implementation can be found here:
Between those two algorithms, it is clear that the 0x00
byte is used to toggle compression on and off. Here are the relevant lines of code in Jackcess (toggling handled by the three highlighted lines):
boolean inCompressedMode = true; while(dataEnd < data.length) { if(data[dataEnd] == (byte)0x00) { // handle current segment decodeTextSegment(data, dataStart, dataEnd, inCompressedMode, textBuf); inCompressedMode = !inCompressedMode; ++dataEnd; dataStart = dataEnd; } else { ++dataEnd; }
That is the indicator I was referring to. Knowing how the algorithm works, we can create a test to see if UNCOMPRESS
can handle that format:
SELECT UNCOMPRESS(0x31323334616263313233); -- 1234abc123 SELECT UNCOMPRESS(0x313233340061626300313233); -- 1234 SELECT UNCOMPRESS(0xFFFE313233340061626300313233); -- ÿþ1234
As you can see, it cannot. Without or without the “is the string compressed or not” indicator set (the 0xFFFE
as the first two bytes in the third test, which is how Access knows if the data is compressed or not, and just happens to be the Byte Order Mark for UTF-16 LE), it interprets the 0x00
byte as the “null” string terminator (\0
in many languages) instead of turning off the compression and interpreting the next 2 bytes as a single UTF-16 code unit.
Side notes about MS Access “Unicode Compression”, not related to UNCOMPRESS
- This Access approach to compression uses
0x00
bytes to toggle compression on and off, and the0x00
bytes have been removed from the original data so this should be a reliable process, right? But what about Unicode code points in which the second byte is0x00
, such as code point U+5000 (i.e. “倀”)? If that0x00
byte remains, then that would greatly complicate the parsing. But Access can store Unicode characters, so this scenario is being handled in some way. To figure out how, I created a test Access DB, created a table with a text field, and entered the following value (containing U+5000):Bob***ŴĤŶ倀健伷倀Œ***
When I viewed the contents of the Access .accdb file, I saw the following bytes stored for that value:
4200 6F00 6200 2A00 2A00 2A00 7401 2401 7601 0050 6550 374F 0050 5001 2A00 2A00 2A00 B o b * * * Ŵ Ĥ Ŷ 倀 健 伷 倀 Œ * * * What’s interesting about that sequence of bytes is that it is not compressed (this field did not start with
0xFFFE
). Yet entering the same string without the two U+5000 characters will show up as compressed in the data file (this field did start with0xFFFE
). This means that the MS Access way of handling strings containing code points in which the second byte is0x00
is to simply not handle them. - The documentation quoted at the beginning of this section mentioned: “Unicode representation of many character sets, those formerly denoted as Single-Byte Character Sets (SBCS), can easily be compressed to a single byte“. This is incorrect. Yes, Unicode does represent many character sets. However, there are not multiple character sets that are being compressed into a single byte by this method of compression. As we saw earlier, in the Which Bytes Are Converted to UTF-16? section, only the first 256 code points can be compressed, and those are the same as the ISO-8859-1 Single Byte Character Set (which again, is roughly the same as the Windows-1252 character set). Yet the Hebrew characters, also from a Single Byte Character Set (Windows-1255), all mapped to code points above U+00FF and cannot be compressed (in this method). So, only a single character set, Latin 1, can be compressed (using this algorithm).
Official Unicode Compression
Just to be clear, the compressed format that UNCOMPRESS
is working with is, again, a proprietary compression technique that is not in any way related to actual Unicode Compression. The official Unicode compression is described in “Unicode Technical Standard #6:
A STANDARD COMPRESSION SCHEME FOR UNICODE“. I did test with the Example 9.2 data from that Unicode.org page:
DECLARE @Russian VARBINARY(50) = 0x129CBEC1BAB2B0; SELECT DATALENGTH(@Russian) AS [SourceByteCount], DATALENGTH(UNCOMPRESS(@Russian)) AS [UncompressedByteCount], CONVERT(VARBINARY(50), UNCOMPRESS(@Russian)) AS [UncompressedHex], UNCOMPRESS(@Russian) AS [UncompressedCharacters]; /* SourceByteCount: 7 UncompressedByteCount: 14 UncompressedHex: 0x12009C00BE00C100BA00B200B000 UncompressedCharacters: <DC2>¾Áº²° */
The uncompressed value is supposed to be: Москва
Conclusion
Observations
- For some bytes, the “CP 1255” value is the same as the “UNCOMPRESS” value.
- For most bytes, the “CP 1252” value is the same as the “UNCOMPRESS” value. It’s only characters in the
0x80
–0x9F
range that are different. That range is the C1 Control characters (more info: C1 Controls and Latin-1 Supplement and Latin-1 Supplement 0080—00FF). - For all bytes, the “UTF-16” value is the same as the “UNCOMPRESS” value.
- This only works on the first 256 Unicode Code Points, so it is a very limited, simple form of compression.
- This function is limited to returning at most 4000 characters / 8000 bytes.
- There is no built-in function to produce the “compressed” binary from valid UTF-16 data.
Lessons Learned
The UNCOMPRESS
function does nothing more than add a 0x00
byte to each byte passed in, the result of which is valid UTF-16 Little Endian (i.e. NVARCHAR
) data. Given the various limitations of this function, and the fact that you would have to write your own function to “compress” NVARCHAR
data into this format (a simple CONVERT
won’t work unless you can guarantee that none of the characters found in the 0x80
– 0x9F
range exist in the input data):
DO NOT USE THIS FUNCTION!!
If you need compression because your data is mostly Code Points in the range of 0 – 255 / U+0000 – U+00FF / 0x00
– 0xFF
, then look into Data Compression and Clustered Columnstore Indexes.
Does this function do essentially the same thing as UTF-8? Not really. UTF-8 only has the first 128 characters / Code Points (i.e. 0x00
– 0x7F
) as single-byte characters. And if that is all your data is, then you can simply convert to VARCHAR
. And while UTF-8 is supported natively starting in SQL Server 2019, there are limited scenarios where you should UTF-8 (within SQL Server, that is). And, even when you do save space, you will most likely sacrifice performance (to varying degrees). For more info on UTF-8 support in SQL Server, please see:
Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?
The one potentially valid use for UNCOMPRESS
is if you have ISO-8859-1-encoded data, that is already in binary format, and is no more 4000 bytes / characters. This would work due to the first 256 Unicode Code Points being the ISO-8859-1 character set. However, even if you have data that fits this description, it would be better to convert it to Unicode / UTF-16 Little Endian prior to importing it into SQL Server. Or, if the data does not contain any bytes in the range of 0x80
– 0x9F
, then just import it into a VARCHAR
column that is using any of the various collations associated with code page 1252.
Mentions of UNCOMPRESS()
- “The Guru’s Guide to SQL Server Stored Procedures, XML, and HTML” by Ken Henderson. Copyright © 2002 by Ken Henderson.
- “SQL Server 2000 Stored Procedures Handbook” by Robin Dewson, Louis Davidson, Tony Bain, Chuck Hawkins. Copyright © 2003 by Apress.
- “Encrypt all objects” (SQLTeam forum) post by sterobhun on 2002-07-24 at 06:29:09
- “Undocumented uncompress() function behaves strangely” (Azure Feedback)
- “Undoing compression” (SQL Server Central forum)
- Technically, some of the undocumented built-in functions, such as
CLOUD_DATABASEPROPERTYEX
, that return errors similar to: Msg 195, Level 15, State 10, Line XXXXX'CLOUD_DATABASEPROPERTYEX' is not a recognized built-in function name.
can be executed, but only from within a stored procedure marked as being a “system stored procedure”. However, system stored procedures are a topic for another day 😉. ↩ - Just to be clear: the pattern of Unicode Code Point matching the UTF-16 Big Endian encoding only holds true for the first 65,536 Code Points (U+0000 – U+FFFF). ↩