What Does the Undocumented UNCOMPRESS Function Do?

(last updated: 2019-03-18 @ 14:55 EST / 2019-03-18 @ 18:55 UTC )

IntelliSense in SQL Server Management Studio (SSMS) can be quite helpful. It can save time typing by presenting possible object or function names. And, it can even show you the signatures (i.e. input and output parameters / return types) of stored procedures and functions.

Another thing that IntelliSense does, even though it was certainly not intentional, is show you the names of (some) undocumented functions. Many of these undocumented functions won’t let you execute them, so not much to investigate there1 . But, several of these undocumented built-in functions can be executed. One of them is UNCOMPRESS. This function is not to be confused with DECOMPRESS, the companion to COMPRESS, which are actually GUnzip and GZip, respectively (and were introduced in SQL Server 2016, if you haven’t seen them before).

No, UNCOMPRESS does something entirely different, but the only documentation is the function signature that appears in SSMS:

built-in function UNCOMPRESS(Param1 varbinary(1)) RETURNS nvarchar(1)

I did a bit of searching around, and all I could find were a couple of references to using it on the ctext column in sys.syscomments, but only if the status column (a bit-masked value) had the “2” bit set (i.e. status & 2 = 2 ). The first two entries in the “Mentions of UNCOMPRESS()” section at the end of this post are books that contain this same info. With only that one clue to go on, I used the following query to find some data that was meant to be passed into this function:

SELECT *
FROM   sys.syscomments
WHERE  [compressed] = 1
OR     [status] & 2 = 2;

I tried in [master], [msdb], and even [MSSQLSystemResource] (in single-user mode) but no rows were ever returned (I tested in SQL Server 2017). Between being used for a deprecated (as of SQL Server 2005) system compatibility view, and that view not even returning rows that would make use of this function, it seems safe to conclude that this function is obsolete in addition to being undocumented.

Why Document?

Why document an undocumented, unsupported, and possibly obsolete function? Well:

  1. because it was a puzzle to solve, and
  2. because it’s undocumented we don’t have much (or any) info on it, AND because it shows up in SSMS IntelliSense people can find it, AND because it can be executed, people might attempt to use it in their code. Therefore, it’s important to understand how it works and why we shouldn’t use it (aside from it being “undocumented”, which means unsupported, which is enough to convince some folks, but not everyone).

What Now?

With no clear indications of what the UNCOMPRESS function does, we can at least pass in some simple values to see what comes back, and see if we can make sense of the output. For the following tests, please keep in mind that “8-bit” refers to the VARCHAR, CHAR, and TEXT (deprecated) datatypes. And, “16-bit” refers to the NVARCHAR, NCHAR, NTEXT (deprecated),and XML datatypes.

Single Character Tests

We will start with the easiest test by passing in one character.

The first query passes a VARCHAR upper-case “A” (having a value of 0x41) into UNCOMPRESS , and gets back the same character, but with an extra byte of 0x00 added on. This should make sense since this function returns NVARCHAR, which is UTF-16 (characters are either 2 bytes or 4 bytes). The Unicode Code Point is actually U+0041, but SQL Server / Windows / .NET use Little Endian, so the bytes are in reverse order, hence 4100 2 . At this point, the UNCOMPRESS function is doing just what the CONVERT function does, so it seems a little redundant.

The second query passes an NVARCHAR upper-case “B” (having a value of 0x4200) into UNCOMPRESS , and gets back the same character, but this time with an extra two bytes of 0x00 added on. This is not what the CONVERT function does!

-- 8-bit test 1:
SELECT '8-bit' AS [Encoding],
       CONVERT(VARBINARY(10), 'A') AS [Hex"A"],
       UNCOMPRESS(CONVERT(VARBINARY(10), 'A')) AS [Uncompressed"A"],
       CONVERT(VARBINARY(10), UNCOMPRESS(CONVERT(VARBINARY(10), 'A')))
           AS [HexUncompressed"A"];
/*
Encoding    Hex"A"    Uncompressed"A"    HexUncompressed"A"
8-bit       0x41      A                  0x4100
*/

-- 16-bit test 1:
SELECT '16-bit' AS [Encoding],
       CONVERT(VARBINARY(10), N'B') AS [Hex"B"],
       UNCOMPRESS(CONVERT(VARBINARY(10), N'B')) AS [Uncompressed"B"],
       CONVERT(VARBINARY(10), UNCOMPRESS(CONVERT(VARBINARY(10), N'B')))
           AS [HexUncompressed"B"],
       CONVERT(VARBINARY(10), CONVERT(NVARCHAR(5), CONVERT(VARBINARY(10),
           N'B'))) AS [HexConverted"B"];
/*
Encoding   Hex"B"   Uncompressed"B"   HexUncompressed"B"   HexConverted"B"
16-bit     0x4200   B                 0x42000000           0x4200
*/

Two Character Tests

The next set of tests will pass in two characters.

The first query again shows that each VARCHAR byte comes back as the same character, but in the expected UTF-16 LE (Little Endian) encoding.

The second query, passing in two NVARCHAR characters, helps to clarify two things:

  1. Every single byte going into UNCOMPRESS comes back as UTF-16 LE (with the extra 0x00 byte added on). Hence, passing in a character that is already in UTF-16 LE encoding (e.g. “D” being the two bytes 0x44 and 0x00), will have each of its two bytes converted into UTF-16 LE, leaving us with 0x4400 and 0x0000, or 0x44000000 (as you can see in the “HexUncompressed”DD”” field).
  2. Even though strings can store character “0” (i.e. U+0000, (null), \0, 0x00, CHAR(0), NCHAR(0), \x00, and so on), they cannot be displayed, nor can any character after them being displayed (as you can see in the “Uncompressed”DD”” field). This is most likely due to character “0” being the string-terminator in many languages.
-- 8-bit test 2:
SELECT '8-bit' AS [Encoding],
       CONVERT(VARBINARY(10), 'CC') AS [Hex"CC"],
       UNCOMPRESS(CONVERT(VARBINARY(10), 'CC')) AS [Uncompressed"CC"],
       CONVERT(VARBINARY(10), UNCOMPRESS(CONVERT(VARBINARY(10), 'CC')))
           AS [HexUncompressed"CC"];
/*
Encoding    Hex"CC"    Uncompressed"CC"    HexUncompressed"CC"
8-bit       0x4343     CC                  0x43004300
*/

-- 16-bit test 2:
SELECT '16-bit' AS [Encoding],
       CONVERT(VARBINARY(10), N'DD') AS [Hex"DD"],
       UNCOMPRESS(CONVERT(VARBINARY(10), N'DD')) AS [Uncompressed"DD"],
       CONVERT(VARBINARY(10), UNCOMPRESS(CONVERT(VARBINARY(10), N'DD')))
           AS [HexUncompressed"DD"];
/*
Encoding    Hex"DD"       Uncompressed"DD"    HexUncompressed"DD"
16-bit      0x44004400    D                   0x4400000044000000
*/

Confirmation

Just to make sure that UNCOMPRESS only works with single-byte characters, let’s pass in something besides two “C”s or two “D”s: Unicode Code Point U+4234 (i.e. 䈴).

SELECT NCHAR(0x4234),      -- 䈴
       UNCOMPRESS(0x4234), -- B4
       NCHAR(0x42),        -- B
       NCHAR(0x34);        -- 4

 

Which Bytes Are Converted to UTF-16?

Now that we know that we are dealing with single-byte characters, which single-byte characters specifically are they? Are they VARCHAR characters of various code pages? Are they VARCHAR characters from one particular code page? Are they NVARCHAR / UTF-16 characters in the U+0000 through U+00FF range that all have a trailing byte of 0x00? Something else perhaps?

Given that characters with values in the range of 0 – 127 (decimal) / 0x00 – 0x7F (hex) are the same across all code pages / encodings that can be represented in SQL Server, only testing with those (i.e. US English, digits 0 – 9, and some punctuation) often hides / obscures important functional differences. So, we need to test values 128 – 255 / 0x80 – 0xFF across several different code pages / encodings.

Create and Populate Table

The following queries will set up the test data that we need to see (or at least confirm) what is actually happening. Code page 1252 is Latin1 (we are looking at this because it’s used in several collations: anything with “Latin1_General” in the name, French, etc), and code page 1255 is Hebrew (which is distinctly different from 1252, so it will be easy to see differences). Finally, UTF-16 is the encoding used by NVARCHAR data. For each row, we are inserting a single byte in the range of 0x000xFF into each column. We can then easily compare the resulting character of each byte with the output of UNCOMPRESS .

-- DROP TABLE #Uncompress;
CREATE TABLE #Uncompress
(
    [ValueHex] BINARY(1) NOT NULL PRIMARY KEY,
    [Value] AS (CONVERT(TINYINT, [ValueHex])),
    [CP1252] VARCHAR(10) COLLATE Latin1_General_100_CI_AS_SC,
    [CP1255] VARCHAR(10) COLLATE Hebrew_100_CI_AS_SC,
    [UTF16] NVARCHAR(10) COLLATE Latin1_General_100_CI_AS_SC
);

SET NOCOUNT ON;
;WITH cte AS
(
    SELECT TOP (256) CONVERT(BINARY(1), ROW_NUMBER() OVER
                              (ORDER BY (SELECT 0)) - 1) AS [num]
    FROM   master.sys.all_columns
)
INSERT INTO #Uncompress ([ValueHex], [CP1252], [CP1255], [UTF16])
    SELECT num, num, num, num
    FROM   cte;

View the Evidence

The query below will show us the character that each byte represents in each of the three encodings. It also feeds that same byte to the UNCOMPRESS function, and shows the underlying byte representation of each character after that byte is stored in the NVARCHAR column and passed into the UNCOMPRESS function. And, because the characters for each byte in the range of 0x000x7F are the same across the encodings, the query only returns the 0x800xFF range (you can easily comment out the WHERE clause to see the boring 0x000x7F range).

There are additional notes below the query.

SELECT chr.[Value],
       chr.[ValueHex],
       chr.[CP1252],
       CONVERT(BINARY(2), UNICODE(chr.[CP1252])) AS [CP1252],
       chr.[CP1255],
       CONVERT(BINARY(2), UNICODE(chr.[CP1255])) AS [CP1255],
       chr.[UTF16],
       CONVERT(BINARY(2), UNICODE(chr.[UTF16])) AS [UTF16],
       UNCOMPRESS(chr.[ValueHex]) AS [UNCOMPRESS],
       CONVERT(BINARY(2), chr.[UTF16]) AS [UTF16bytes],
       CONVERT(BINARY(2), UNCOMPRESS(chr.[ValueHex])) AS [UNCOMPRESSbytes]
FROM   #Uncompress chr
WHERE  chr.[Value] > 127
ORDER BY chr.[Value];

In the table shown below, please note:

  • The “CP 1252”, “CP 1255”, and “UTF-16” columns are in the form of “Character (Unicode Code Point)”.
  • The right-most two columns — “UTF-16 bytes” and “UNCOMPRESS bytes” — show the actual bytes as they exist in the system, in Little Endian. So, Code Point U+0081 would be represented by the bytes 0x8100.
  • The Unicode Code Point is shown for each character so it will be easier to see which characters are exactly the same as the “UNCOMPRESS” column, especially for the columns that have non-printable characters.
  • Code Points are shown instead of underlying byte value since the underlying byte value is already shown in the two left-most columns. Meaning, the byte 0x0E is the character “à” on code page 1252, while that same byte is the character “א” on code page 1255.


Value (Hex) CP 1252 CP 1255 UTF-16 UNCOMPRESS UTF-16
bytes
UNCOMPRESS
bytes
128 (0x80) (20AC) (20AC) (0080) 8000 8000
129 (0x81) (0081) (0081) (0081) 8100 8100
130 (0x82) (201A) (201A) (0082) 8200 8200
131 (0x83) ƒ (0192) ƒ (0192) (0083) 8300 8300
132 (0x84) (201E) (201E) (0084) 8400 8400
133 (0x85) (2026) (2026) (0085) 8500 8500
134 (0x86) (2020) (2020) (0086) 8600 8600
135 (0x87) (2021) (2021) (0087) 8700 8700
136 (0x88) ˆ (02C6) ˆ (02C6) (0088) 8800 8800
137 (0x89) (2030) (2030) (0089) 8900 8900
138 (0x8A) Š (0160) (008A) (008A) 8A00 8A00
139 (0x8B) (2039) (2039) (008B) 8B00 8B00
140 (0x8C) Π(0152) (008C) (008C) 8C00 8C00
141 (0x8D) (008D) (008D) (008D) 8D00 8D00
142 (0x8E) Ž (017D) (008E) (008E) 8E00 8E00
143 (0x8F) (008F) (008F) (008F) 8F00 8F00
144 (0x90) (0090) (0090) (0090) 9000 9000
145 (0x91) (2018) (2018) (0091) 9100 9100
146 (0x92) (2019) (2019) (0092) 9200 9200
147 (0x93) (201C) (201C) (0093) 9300 9300
148 (0x94) (201D) (201D) (0094) 9400 9400
149 (0x95) (2022) (2022) (0095) 9500 9500
150 (0x96) (2013) (2013) (0096) 9600 9600
151 (0x97) (2014) (2014) (0097) 9700 9700
152 (0x98) ˜ (02DC) ˜ (02DC) (0098) 9800 9800
153 (0x99) (2122) (2122) (0099) 9900 9900
154 (0x9A) š (0161) (009A) (009A) 9A00 9A00
155 (0x9B) (203A) (203A) (009B) 9B00 9B00
156 (0x9C) œ (0153) (009C) (009C) 9C00 9C00
157 (0x9D) (009D) (009D) (009D) 9D00 9D00
158 (0x9E) ž (017E) (009E) (009E) 9E00 9E00
159 (0x9F) Ÿ (0178) (009F) (009F) 9F00 9F00
160 (0xA0)   (00A0)   (00A0)   (00A0)   A000 A000
161 (0xA1) ¡ (00A1) ¡ (00A1) ¡ (00A1) ¡ A100 A100
162 (0xA2) ¢ (00A2) ¢ (00A2) ¢ (00A2) ¢ A200 A200
163 (0xA3) £ (00A3) £ (00A3) £ (00A3) £ A300 A300
164 (0xA4) ¤ (00A4) (20AA) ¤ (00A4) ¤ A400 A400
165 (0xA5) ¥ (00A5) ¥ (00A5) ¥ (00A5) ¥ A500 A500
166 (0xA6) ¦ (00A6) ¦ (00A6) ¦ (00A6) ¦ A600 A600
167 (0xA7) § (00A7) § (00A7) § (00A7) § A700 A700
168 (0xA8) ¨ (00A8) ¨ (00A8) ¨ (00A8) ¨ A800 A800
169 (0xA9) © (00A9) © (00A9) © (00A9) © A900 A900
170 (0xAA) ª (00AA) × (00D7) ª (00AA) ª AA00 AA00
171 (0xAB) « (00AB) « (00AB) « (00AB) « AB00 AB00
172 (0xAC) ¬ (00AC) ¬ (00AC) ¬ (00AC) ¬ AC00 AC00
173 (0xAD) ­ (00AD) ­ (00AD) ­ (00AD) ­ AD00 AD00
174 (0xAE) ® (00AE) ® (00AE) ® (00AE) ® AE00 AE00
175 (0xAF) ¯ (00AF) ¯ (00AF) ¯ (00AF) ¯ AF00 AF00
176 (0xB0) ° (00B0) ° (00B0) ° (00B0) ° B000 B000
177 (0xB1) ± (00B1) ± (00B1) ± (00B1) ± B100 B100
178 (0xB2) ² (00B2) ² (00B2) ² (00B2) ² B200 B200
179 (0xB3) ³ (00B3) ³ (00B3) ³ (00B3) ³ B300 B300
180 (0xB4) ´ (00B4) ´ (00B4) ´ (00B4) ´ B400 B400
181 (0xB5) µ (00B5) µ (00B5) µ (00B5) µ B500 B500
182 (0xB6) (00B6) (00B6) (00B6) B600 B600
183 (0xB7) · (00B7) · (00B7) · (00B7) · B700 B700
184 (0xB8) ¸ (00B8) ¸ (00B8) ¸ (00B8) ¸ B800 B800
185 (0xB9) ¹ (00B9) ¹ (00B9) ¹ (00B9) ¹ B900 B900
186 (0xBA) º (00BA) ÷ (00F7) º (00BA) º BA00 BA00
187 (0xBB) » (00BB) » (00BB) » (00BB) » BB00 BB00
188 (0xBC) ¼ (00BC) ¼ (00BC) ¼ (00BC) ¼ BC00 BC00
189 (0xBD) ½ (00BD) ½ (00BD) ½ (00BD) ½ BD00 BD00
190 (0xBE) ¾ (00BE) ¾ (00BE) ¾ (00BE) ¾ BE00 BE00
191 (0xBF) ¿ (00BF) ¿ (00BF) ¿ (00BF) ¿ BF00 BF00
192 (0xC0) À (00C0) ְ (05B0) À (00C0) À C000 C000
193 (0xC1) Á (00C1) ֱ (05B1) Á (00C1) Á C100 C100
194 (0xC2) Â (00C2) ֲ (05B2) Â (00C2) Â C200 C200
195 (0xC3) Ã (00C3) ֳ (05B3) Ã (00C3) Ã C300 C300
196 (0xC4) Ä (00C4) ִ (05B4) Ä (00C4) Ä C400 C400
197 (0xC5) Å (00C5) ֵ (05B5) Å (00C5) Å C500 C500
198 (0xC6) Æ (00C6) ֶ (05B6) Æ (00C6) Æ C600 C600
199 (0xC7) Ç (00C7) ַ (05B7) Ç (00C7) Ç C700 C700
200 (0xC8) È (00C8) ָ (05B8) È (00C8) È C800 C800
201 (0xC9) É (00C9) ֹ (05B9) É (00C9) É C900 C900
202 (0xCA) Ê (00CA) ֺ (05BA) Ê (00CA) Ê CA00 CA00
203 (0xCB) Ë (00CB) ֻ (05BB) Ë (00CB) Ë CB00 CB00
204 (0xCC) Ì (00CC) ּ (05BC) Ì (00CC) Ì CC00 CC00
205 (0xCD) Í (00CD) ֽ (05BD) Í (00CD) Í CD00 CD00
206 (0xCE) Î (00CE) ־ (05BE) Î (00CE) Î CE00 CE00
207 (0xCF) Ï (00CF) ֿ (05BF) Ï (00CF) Ï CF00 CF00
208 (0xD0) Ð (00D0) ׀ (05C0) Ð (00D0) Ð D000 D000
209 (0xD1) Ñ (00D1) ׁ (05C1) Ñ (00D1) Ñ D100 D100
210 (0xD2) Ò (00D2) ׂ (05C2) Ò (00D2) Ò D200 D200
211 (0xD3) Ó (00D3) ׃ (05C3) Ó (00D3) Ó D300 D300
212 (0xD4) Ô (00D4) װ (05F0) Ô (00D4) Ô D400 D400
213 (0xD5) Õ (00D5) ױ (05F1) Õ (00D5) Õ D500 D500
214 (0xD6) Ö (00D6) ײ (05F2) Ö (00D6) Ö D600 D600
215 (0xD7) × (00D7) ׳ (05F3) × (00D7) × D700 D700
216 (0xD8) Ø (00D8) ״ (05F4) Ø (00D8) Ø D800 D800
217 (0xD9) Ù (00D9) (F88D) Ù (00D9) Ù D900 D900
218 (0xDA) Ú (00DA) (F88E) Ú (00DA) Ú DA00 DA00
219 (0xDB) Û (00DB) (F88F) Û (00DB) Û DB00 DB00
220 (0xDC) Ü (00DC) (F890) Ü (00DC) Ü DC00 DC00
221 (0xDD) Ý (00DD) (F891) Ý (00DD) Ý DD00 DD00
222 (0xDE) Þ (00DE) (F892) Þ (00DE) Þ DE00 DE00
223 (0xDF) ß (00DF) (F893) ß (00DF) ß DF00 DF00
224 (0xE0) à (00E0) א (05D0) à (00E0) à E000 E000
225 (0xE1) á (00E1) ב (05D1) á (00E1) á E100 E100
226 (0xE2) â (00E2) ג (05D2) â (00E2) â E200 E200
227 (0xE3) ã (00E3) ד (05D3) ã (00E3) ã E300 E300
228 (0xE4) ä (00E4) ה (05D4) ä (00E4) ä E400 E400
229 (0xE5) å (00E5) ו (05D5) å (00E5) å E500 E500
230 (0xE6) æ (00E6) ז (05D6) æ (00E6) æ E600 E600
231 (0xE7) ç (00E7) ח (05D7) ç (00E7) ç E700 E700
232 (0xE8) è (00E8) ט (05D8) è (00E8) è E800 E800
233 (0xE9) é (00E9) י (05D9) é (00E9) é E900 E900
234 (0xEA) ê (00EA) ך (05DA) ê (00EA) ê EA00 EA00
235 (0xEB) ë (00EB) כ (05DB) ë (00EB) ë EB00 EB00
236 (0xEC) ì (00EC) ל (05DC) ì (00EC) ì EC00 EC00
237 (0xED) í (00ED) ם (05DD) í (00ED) í ED00 ED00
238 (0xEE) î (00EE) מ (05DE) î (00EE) î EE00 EE00
239 (0xEF) ï (00EF) ן (05DF) ï (00EF) ï EF00 EF00
240 (0xF0) ð (00F0) נ (05E0) ð (00F0) ð F000 F000
241 (0xF1) ñ (00F1) ס (05E1) ñ (00F1) ñ F100 F100
242 (0xF2) ò (00F2) ע (05E2) ò (00F2) ò F200 F200
243 (0xF3) ó (00F3) ף (05E3) ó (00F3) ó F300 F300
244 (0xF4) ô (00F4) פ (05E4) ô (00F4) ô F400 F400
245 (0xF5) õ (00F5) ץ (05E5) õ (00F5) õ F500 F500
246 (0xF6) ö (00F6) צ (05E6) ö (00F6) ö F600 F600
247 (0xF7) ÷ (00F7) ק (05E7) ÷ (00F7) ÷ F700 F700
248 (0xF8) ø (00F8) ר (05E8) ø (00F8) ø F800 F800
249 (0xF9) ù (00F9) ש (05E9) ù (00F9) ù F900 F900
250 (0xFA) ú (00FA) ת (05EA) ú (00FA) ú FA00 FA00
251 (0xFB) û (00FB) (F894) û (00FB) û FB00 FB00
252 (0xFC) ü (00FC) (F895) ü (00FC) ü FC00 FC00
253 (0xFD) ý (00FD) (200E) ý (00FD) ý FD00 FD00
254 (0xFE) þ (00FE) (200F) þ (00FE) þ FE00 FE00
255 (0xFF) ÿ (00FF) (F896) ÿ (00FF) ÿ FF00 FF00

 

Do Large Values / LOBs Work?

We now know what this function does to the bytes that are passed in. But how many bytes are allowed to be passed in? Some functions do not work with the MAX types. Could this be one of them? That should be easy to find out:

DECLARE @LargeValue VARBINARY(8000);

SET @LargeValue = CONVERT(VARBINARY(8000), REPLICATE('a', 4000));
SELECT DATALENGTH(@LargeValue) AS [InputSize],
       DATALENGTH(UNCOMPRESS(@LargeValue)) AS [OutputSize];
-- InputSize    OutputSize
--      4000          8000

SET @LargeValue = CONVERT(VARBINARY(8000), REPLICATE('a', 4001));
SELECT DATALENGTH(@LargeValue) AS [InputSize],
       DATALENGTH(UNCOMPRESS(@LargeValue)) AS [OutputSize];
-- InputSize    OutputSize
--      4001          NULL

Nope. While the input type is not a MAX type, the output type would have needed to have been NVARCHAR(MAX) in order to have held the 8002 bytes taken up by the 4001 characters.


 

Microsoft Access 2000 “Unicode Compression”

Mitch Schroeter suggested to me that perhaps the UNCOMPRESS function was intended to work on data coming directly from Access 2000 (or newer) and compressed via the WITH COMPRESSION option of the CREATE TABLE statement. The documentation for CREATE TABLE (for Microsoft Access, not SQL Server) states the following towards the end of the Remarks section:

The WITH COMPRESSION attribute was added for CHARACTER columns because of the change to the Unicode character representation format. Unicode characters uniformly require two bytes for each character. For existing Microsoft Jet databases that contain predominately character data, this could mean that the database file would nearly double in size when converted to the Microsoft Access database engine format. However, Unicode representation of many character sets, those formerly denoted as Single-Byte Character Sets (SBCS), can easily be compressed to a single byte. If you define a CHARACTER column with this attribute, data will automatically be compressed as it is stored and uncompressed when retrieved from the column.

While this does sound similar, it is not the exact same compression that the UNCOMPRESS function expects. There is some overlap in the behavior, but the UNCOMPRESS function is more simplistic than Access’s “Unicode Compression” (that term is in quotes because it is not true Unicode Compression). If Access was doing nothing more than removing the “0x00” bytes, then there would be no way to determine when to add them back in upon uncompressing; very few of the 65,536 two-byte code points have 0x00 bytes, so any algorithm will need to deal with non-compressible code points. There needs to be an indicator of some sort to tell the parser when a byte should be prefixed with a 0x00, appended with 0x00, or left alone. For example, if it encounters two bytes — 0xD5E2 — should the next two bytes of output be: 0x00D5, 0xD500, or 0xD5E2? We need more info to figure this out.

I found a technical description of the algorithm used to do the Microsoft Access style “Unicode Compression” and the resulting format. It’s described in this ticket for a Java-based MS Access library:

https://sourceforge.net/p/jackcess/bugs/111/

But that code at the top is from another library, libmdb, whereas the ticket is for Jackcess. The Jackcess implementation can be found here:

https://sourceforge.net/p/jackcess/code/HEAD/tree/jackcess/trunk/src/main/java/com/healthmarketscience/jackcess/impl/ColumnImpl.java#l1483

Between those two algorithms, it is clear that the 0x00 byte is used to toggle compression on and off. Here are the relevant lines of code in Jackcess (toggling handled by the three highlighted lines):

      boolean inCompressedMode = true;
      while(dataEnd < data.length) {
        if(data[dataEnd] == (byte)0x00) {

          // handle current segment
          decodeTextSegment(data, dataStart, dataEnd, inCompressedMode,
                            textBuf);
          inCompressedMode = !inCompressedMode;
          ++dataEnd;
          dataStart = dataEnd;

        } else {
          ++dataEnd;
        }

That is the indicator I was referring to. Knowing how the algorithm works, we can create a test to see if UNCOMPRESS can handle that format:

SELECT UNCOMPRESS(0x31323334616263313233);
-- 1234abc123

SELECT UNCOMPRESS(0x313233340061626300313233);
-- 1234

SELECT UNCOMPRESS(0xFFFE313233340061626300313233);
-- ÿþ1234

As you can see, it cannot. Without or without the “is the string compressed or not” indicator set (the 0xFFFE as the first two bytes in the third test, which is how Access knows if the data is compressed or not, and just happens to be the Byte Order Mark for UTF-16 LE), it interprets the 0x00 byte as the “null” string terminator (\0 in many languages) instead of turning off the compression and interpreting the next 2 bytes as a single UTF-16 code unit.

Side notes about MS Access “Unicode Compression”, not related to UNCOMPRESS

  1. This Access approach to compression uses 0x00 bytes to toggle compression on and off, and the 0x00 bytes have been removed from the original data so this should be a reliable process, right? But what about Unicode code points in which the second byte is 0x00, such as code point U+5000 (i.e. “倀”)? If that 0x00 byte remains, then that would greatly complicate the parsing. But Access can store Unicode characters, so this scenario is being handled in some way. To figure out how, I created a test Access DB, created a table with a text field, and entered the following value (containing U+5000):
     
    Bob***ŴĤŶ倀健伷倀Œ***

    When I viewed the contents of the Access .accdb file, I saw the following bytes stored for that value:

    4200 6F00 6200 2A00 2A00 2A00 7401 2401 7601 0050 6550 374F 0050 5001 2A00 2A00 2A00
    B o b * * * Ŵ Ĥ Ŷ 倀 倀 Œ * * *

    What’s interesting about that sequence of bytes is that it is not compressed (this field did not start with 0xFFFE). Yet entering the same string without the two U+5000 characters will show up as compressed in the data file (this field did start with 0xFFFE). This means that the MS Access way of handling strings containing code points in which the second byte is 0x00 is to simply not handle them.

  2. The documentation quoted at the beginning of this section mentioned: “Unicode representation of many character sets, those formerly denoted as Single-Byte Character Sets (SBCS), can easily be compressed to a single byte“. This is incorrect. Yes, Unicode does represent many character sets. However, there are not multiple character sets that are being compressed into a single byte by this method of compression. As we saw earlier, in the Which Bytes Are Converted to UTF-16? section, only the first 256 code points can be compressed, and those are the same as the ISO-8859-1 Single Byte Character Set (which again, is roughly the same as the Windows-1252 character set). Yet the Hebrew characters, also from a Single Byte Character Set (Windows-1255), all mapped to code points above U+00FF and cannot be compressed (in this method). So, only a single character set, Latin 1, can be compressed (using this algorithm).

 

Official Unicode Compression

Just to be clear, the compressed format that UNCOMPRESS is working with is, again, a proprietary compression technique that is not in any way related to actual Unicode Compression. The official Unicode compression is described in “Unicode Technical Standard #6:
A STANDARD COMPRESSION SCHEME FOR UNICODE
“. I did test with the Example 9.2 data from that Unicode.org page:

DECLARE @Russian VARBINARY(50) = 0x129CBEC1BAB2B0;
SELECT DATALENGTH(@Russian) AS [SourceByteCount],
       DATALENGTH(UNCOMPRESS(@Russian)) AS [UncompressedByteCount],
       CONVERT(VARBINARY(50), UNCOMPRESS(@Russian)) AS [UncompressedHex],
       UNCOMPRESS(@Russian) AS [UncompressedCharacters];
/*
SourceByteCount:            7
UncompressedByteCount:     14
UncompressedHex:           0x12009C00BE00C100BA00B200B000
UncompressedCharacters:    <DC2>œ¾Áº²°
*/

The uncompressed value is supposed to be: Москва

 

Conclusion

Observations

  1. For some bytes, the “CP 1255” value is the same as the “UNCOMPRESS” value.
  2. For most bytes, the “CP 1252” value is the same as the “UNCOMPRESS” value. It’s only characters in the 0x800x9F range that are different. That range is the C1 Control characters (more info: C1 Controls and Latin-1 Supplement and Latin-1 Supplement 0080—00FF).
  3. For all bytes, the “UTF-16” value is the same as the “UNCOMPRESS” value.
  4. This only works on the first 256 Unicode Code Points, so it is a very limited, simple form of compression.
  5. This function is limited to returning at most 4000 characters / 8000 bytes.
  6. There is no built-in function to produce the “compressed” binary from valid UTF-16 data.

Lessons Learned

The UNCOMPRESS function does nothing more than add a 0x00 byte to each byte passed in, the result of which is valid UTF-16 Little Endian (i.e. NVARCHAR ) data. Given the various limitations of this function, and the fact that you would have to write your own function to “compress” NVARCHAR data into this format (a simple CONVERT won’t work unless you can guarantee that none of the characters found in the 0x800x9F range exist in the input data):

DO NOT USE THIS FUNCTION!!

If you need compression because your data is mostly Code Points in the range of 0 – 255 / U+0000 – U+00FF / 0x000xFF, then look into Data Compression and Clustered Columnstore Indexes.

Does this function do essentially the same thing as UTF-8? Not really. UTF-8 only has the first 128 characters / Code Points (i.e. 0x000x7F) as single-byte characters. And if that is all your data is, then you can simply convert to VARCHAR. And while UTF-8 is supported natively starting in SQL Server 2019, there are limited scenarios where you should UTF-8 (within SQL Server, that is). And, even when you do save space, you will most likely sacrifice performance (to varying degrees). For more info on UTF-8 support in SQL Server, please see:

Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?

The one potentially valid use for UNCOMPRESS is if you have ISO-8859-1-encoded data, that is already in binary format, and is no more 4000 bytes / characters. This would work due to the first 256 Unicode Code Points being the ISO-8859-1 character set. However, even if you have data that fits this description, it would be better to convert it to Unicode / UTF-16 Little Endian prior to importing it into SQL Server. Or, if the data does not contain any bytes in the range of 0x800x9F, then just import it into a VARCHAR column that is using any of the various collations associated with code page 1252.

Mentions of UNCOMPRESS()

  1. The Guru’s Guide to SQL Server Stored Procedures, XML, and HTML” by Ken Henderson. Copyright © 2002 by Ken Henderson.
  2. SQL Server 2000 Stored Procedures Handbook” by Robin Dewson, Louis Davidson, Tony Bain, Chuck Hawkins. Copyright © 2003 by Apress.
  3. Encrypt all objects” (SQLTeam forum) post by sterobhun on 2002-07-24 at 06:29:09
  4. Undocumented uncompress() function behaves strangely” (Azure Feedback)
  5. Undoing compression” (SQL Server Central forum)

  1. Technically, some of the undocumented built-in functions, such as CLOUD_DATABASEPROPERTYEX , that return errors similar to: Msg 195, Level 15, State 10, Line XXXXX
    ‘CLOUD_DATABASEPROPERTYEX’ is not a recognized built-in function name.
    can be executed, but only from within a stored procedure marked as being a “system stored procedure”. However, system stored procedures are a topic for another day 😉. 
  2. Just to be clear: the pattern of Unicode Code Point matching the UTF-16 Big Endian encoding only holds true for the first 65,536 Code Points (U+0000 – U+FFFF). 

1 thought on “What Does the Undocumented UNCOMPRESS Function Do?”

Leave a Reply