What Does the Undocumented UNCOMPRESS Function Do?

(last updated: 2019-03-18 @ 14:55 EST / 2019-03-18 @ 18:55 UTC )

IntelliSense in SQL Server Management Studio (SSMS) can be quite helpful. It can save time typing by presenting possible object or function names. And, it can even show you the signatures (i.e. input and output parameters / return types) of stored procedures and functions.

Another thing that IntelliSense does, even though it was certainly not intentional, is show you the names of (some) undocumented functions. Many of these undocumented functions won’t let you execute them, so not much to investigate there¹ . But, several of these undocumented built-in functions can be executed. One of them is UNCOMPRESS. This function is not to be confused with DECOMPRESS, the companion to COMPRESS, which are actually GUnzip and GZip, respectively (and were introduced in SQL Server 2016, if you haven’t seen them before).

No, UNCOMPRESS does something entirely different, but the only documentation is the function signature that appears in SSMS:

built-in function UNCOMPRESS(Param1 varbinary(1)) RETURNS nvarchar(1)

I did a bit of searching around, and all I could find were a couple of references to using it on the ctext column in sys.syscomments, but only if the status column (a bit-masked value) had the “2” bit set (i.e. status & 2 = 2 ). The first two entries in the “Mentions of UNCOMPRESS()” section at the end of this post are books that contain this same info. With only that one clue to go on, I used the following query to find some data that was meant to be passed into this function:

SELECT *
FROM   sys.syscomments
WHERE  [compressed] = 1
OR     [status] & 2 = 2;

I tried in [master], [msdb], and even [MSSQLSystemResource] (in single-user mode) but no rows were ever returned (I tested in SQL Server 2017). Between being used for a deprecated (as of SQL Server 2005) system compatibility view, and that view not even returning rows that would make use of this function, it seems safe to conclude that this function is obsolete in addition to being undocumented.

Why Document?

Why document an undocumented, unsupported, and possibly obsolete function? Well:

because it was a puzzle to solve, and
because it’s undocumented we don’t have much (or any) info on it, AND because it shows up in SSMS IntelliSense people can find it, AND because it can be executed, people might attempt to use it in their code. Therefore, it’s important to understand how it works and why we shouldn’t use it (aside from it being “undocumented”, which means unsupported, which is enough to convince some folks, but not everyone).

What Now?

With no clear indications of what the UNCOMPRESS function does, we can at least pass in some simple values to see what comes back, and see if we can make sense of the output. For the following tests, please keep in mind that “8-bit” refers to the VARCHAR, CHAR, and TEXT (deprecated) datatypes. And, “16-bit” refers to the NVARCHAR, NCHAR, NTEXT (deprecated),and XML datatypes.

Single Character Tests

We will start with the easiest test by passing in one character.

The first query passes a VARCHAR upper-case “A” (having a value of 0x41) into UNCOMPRESS , and gets back the same character, but with an extra byte of 0x00 added on. This should make sense since this function returns NVARCHAR, which is UTF-16 (characters are either 2 bytes or 4 bytes). The Unicode Code Point is actually U+0041, but SQL Server / Windows / .NET use Little Endian, so the bytes are in reverse order, hence 4100 ² . At this point, the UNCOMPRESS function is doing just what the CONVERT function does, so it seems a little redundant.

The second query passes an NVARCHAR upper-case “B” (having a value of 0x4200) into UNCOMPRESS , and gets back the same character, but this time with an extra two bytes of 0x00 added on. This is not what the CONVERT function does!

-- 8-bit test 1:
SELECT '8-bit' AS [Encoding],
       CONVERT(VARBINARY(10), 'A') AS [Hex"A"],
       UNCOMPRESS(CONVERT(VARBINARY(10), 'A')) AS [Uncompressed"A"],
       CONVERT(VARBINARY(10), UNCOMPRESS(CONVERT(VARBINARY(10), 'A')))
           AS [HexUncompressed"A"];
/*
Encoding    Hex"A"    Uncompressed"A"    HexUncompressed"A"
8-bit       0x41      A                  0x4100
*/

-- 16-bit test 1:
SELECT '16-bit' AS [Encoding],
       CONVERT(VARBINARY(10), N'B') AS [Hex"B"],
       UNCOMPRESS(CONVERT(VARBINARY(10), N'B')) AS [Uncompressed"B"],
       CONVERT(VARBINARY(10), UNCOMPRESS(CONVERT(VARBINARY(10), N'B')))
           AS [HexUncompressed"B"],
       CONVERT(VARBINARY(10), CONVERT(NVARCHAR(5), CONVERT(VARBINARY(10),
           N'B'))) AS [HexConverted"B"];
/*
Encoding   Hex"B"   Uncompressed"B"   HexUncompressed"B"   HexConverted"B"
16-bit     0x4200   B                 0x42000000           0x4200
*/

Two Character Tests

The next set of tests will pass in two characters.

The first query again shows that each VARCHAR byte comes back as the same character, but in the expected UTF-16 LE (Little Endian) encoding.

The second query, passing in two NVARCHAR characters, helps to clarify two things:

Every single byte going into UNCOMPRESS comes back as UTF-16 LE (with the extra 0x00 byte added on). Hence, passing in a character that is already in UTF-16 LE encoding (e.g. “D” being the two bytes 0x44 and 0x00), will have each of its two bytes converted into UTF-16 LE, leaving us with 0x4400 and 0x0000, or 0x44000000 (as you can see in the “HexUncompressed”DD”” field).
Even though strings can store character “0” (i.e. U+0000, (null), \0, 0x00, CHAR(0), NCHAR(0), \x00, and so on), they cannot be displayed, nor can any character after them being displayed (as you can see in the “Uncompressed”DD”” field). This is most likely due to character “0” being the string-terminator in many languages.

-- 8-bit test 2:
SELECT '8-bit' AS [Encoding],
       CONVERT(VARBINARY(10), 'CC') AS [Hex"CC"],
       UNCOMPRESS(CONVERT(VARBINARY(10), 'CC')) AS [Uncompressed"CC"],
       CONVERT(VARBINARY(10), UNCOMPRESS(CONVERT(VARBINARY(10), 'CC')))
           AS [HexUncompressed"CC"];
/*
Encoding    Hex"CC"    Uncompressed"CC"    HexUncompressed"CC"
8-bit       0x4343     CC                  0x43004300
*/

-- 16-bit test 2:
SELECT '16-bit' AS [Encoding],
       CONVERT(VARBINARY(10), N'DD') AS [Hex"DD"],
       UNCOMPRESS(CONVERT(VARBINARY(10), N'DD')) AS [Uncompressed"DD"],
       CONVERT(VARBINARY(10), UNCOMPRESS(CONVERT(VARBINARY(10), N'DD')))
           AS [HexUncompressed"DD"];
/*
Encoding    Hex"DD"       Uncompressed"DD"    HexUncompressed"DD"
16-bit      0x44004400    D                   0x4400000044000000
*/

Confirmation

Just to make sure that UNCOMPRESS only works with single-byte characters, let’s pass in something besides two “C”s or two “D”s: Unicode Code Point U+4234 (i.e. 䈴).

SELECT NCHAR(0x4234),      -- 䈴
       UNCOMPRESS(0x4234), -- B4
       NCHAR(0x42),        -- B
       NCHAR(0x34);        -- 4

Which Bytes Are Converted to UTF-16?

Now that we know that we are dealing with single-byte characters, which single-byte characters specifically are they? Are they VARCHAR characters of various code pages? Are they VARCHAR characters from one particular code page? Are they NVARCHAR / UTF-16 characters in the U+0000 through U+00FF range that all have a trailing byte of 0x00? Something else perhaps?

Given that characters with values in the range of 0 – 127 (decimal) / 0x00 – 0x7F (hex) are the same across all code pages / encodings that can be represented in SQL Server, only testing with those (i.e. US English, digits 0 – 9, and some punctuation) often hides / obscures important functional differences. So, we need to test values 128 – 255 / 0x80 – 0xFF across several different code pages / encodings.

Create and Populate Table

The following queries will set up the test data that we need to see (or at least confirm) what is actually happening. Code page 1252 is Latin1 (we are looking at this because it’s used in several collations: anything with “Latin1_General” in the name, French, etc), and code page 1255 is Hebrew (which is distinctly different from 1252, so it will be easy to see differences). Finally, UTF-16 is the encoding used by NVARCHAR data. For each row, we are inserting a single byte in the range of 0x00 – 0xFF into each column. We can then easily compare the resulting character of each byte with the output of UNCOMPRESS .

-- DROP TABLE #Uncompress;
CREATE TABLE #Uncompress
(
    [ValueHex] BINARY(1) NOT NULL PRIMARY KEY,
    [Value] AS (CONVERT(TINYINT, [ValueHex])),
    [CP1252] VARCHAR(10) COLLATE Latin1_General_100_CI_AS_SC,
    [CP1255] VARCHAR(10) COLLATE Hebrew_100_CI_AS_SC,
    [UTF16] NVARCHAR(10) COLLATE Latin1_General_100_CI_AS_SC
);

SET NOCOUNT ON;
;WITH cte AS
(
    SELECT TOP (256) CONVERT(BINARY(1), ROW_NUMBER() OVER
                              (ORDER BY (SELECT 0)) - 1) AS [num]
    FROM   master.sys.all_columns
)
INSERT INTO #Uncompress ([ValueHex], [CP1252], [CP1255], [UTF16])
    SELECT num, num, num, num
    FROM   cte;

View the Evidence

The query below will show us the character that each byte represents in each of the three encodings. It also feeds that same byte to the UNCOMPRESS function, and shows the underlying byte representation of each character after that byte is stored in the NVARCHAR column and passed into the UNCOMPRESS function. And, because the characters for each byte in the range of 0x00 – 0x7F are the same across the encodings, the query only returns the 0x80 – 0xFF range (you can easily comment out the WHERE clause to see the boring 0x00 – 0x7F range).

There are additional notes below the query.

SELECT chr.[Value],
       chr.[ValueHex],
       chr.[CP1252],
       CONVERT(BINARY(2), UNICODE(chr.[CP1252])) AS [CP1252],
       chr.[CP1255],
       CONVERT(BINARY(2), UNICODE(chr.[CP1255])) AS [CP1255],
       chr.[UTF16],
       CONVERT(BINARY(2), UNICODE(chr.[UTF16])) AS [UTF16],
       UNCOMPRESS(chr.[ValueHex]) AS [UNCOMPRESS],
       CONVERT(BINARY(2), chr.[UTF16]) AS [UTF16bytes],
       CONVERT(BINARY(2), UNCOMPRESS(chr.[ValueHex])) AS [UNCOMPRESSbytes]
FROM   #Uncompress chr
WHERE  chr.[Value] > 127
ORDER BY chr.[Value];

In the table shown below, please note:

The “CP 1252”, “CP 1255”, and “UTF-16” columns are in the form of “Character (Unicode Code Point)”.
The right-most two columns — “UTF-16 bytes” and “UNCOMPRESS bytes” — show the actual bytes as they exist in the system, in Little Endian. So, Code Point U+0081 would be represented by the bytes 0x8100.
The Unicode Code Point is shown for each character so it will be easier to see which characters are exactly the same as the “UNCOMPRESS” column, especially for the columns that have non-printable characters.
Code Points are shown instead of underlying byte value since the underlying byte value is already shown in the two left-most columns. Meaning, the byte 0x0E is the character “à” on code page 1252, while that same byte is the character “א” on code page 1255.

Value (Hex)	CP 1252		CP 1255		UTF-16		UNCOMPRESS	UTF-16 bytes	UNCOMPRESS bytes
128 (0x80)	€	(20AC)	€	(20AC)		(0080)		8000	8000
129 (0x81)		(0081)		(0081)		(0081)		8100	8100
130 (0x82)	‚	(201A)	‚	(201A)		(0082)		8200	8200
131 (0x83)	ƒ	(0192)	ƒ	(0192)		(0083)		8300	8300
132 (0x84)	„	(201E)	„	(201E)		(0084)		8400	8400
133 (0x85)	…	(2026)	…	(2026)		(0085)		8500	8500
134 (0x86)	†	(2020)	†	(2020)		(0086)		8600	8600
135 (0x87)	‡	(2021)	‡	(2021)		(0087)		8700	8700
136 (0x88)	ˆ	(02C6)	ˆ	(02C6)		(0088)		8800	8800
137 (0x89)	‰	(2030)	‰	(2030)		(0089)		8900	8900
138 (0x8A)	Š	(0160)		(008A)		(008A)		8A00	8A00
139 (0x8B)	‹	(2039)	‹	(2039)		(008B)		8B00	8B00
140 (0x8C)	Œ	(0152)		(008C)		(008C)		8C00	8C00
141 (0x8D)		(008D)		(008D)		(008D)		8D00	8D00
142 (0x8E)	Ž	(017D)		(008E)		(008E)		8E00	8E00
143 (0x8F)		(008F)		(008F)		(008F)		8F00	8F00
144 (0x90)		(0090)		(0090)		(0090)		9000	9000
145 (0x91)	‘	(2018)	‘	(2018)		(0091)		9100	9100
146 (0x92)	’	(2019)	’	(2019)		(0092)		9200	9200
147 (0x93)	“	(201C)	“	(201C)		(0093)		9300	9300
148 (0x94)	”	(201D)	”	(201D)		(0094)		9400	9400
149 (0x95)	•	(2022)	•	(2022)		(0095)		9500	9500
150 (0x96)	–	(2013)	–	(2013)		(0096)		9600	9600
151 (0x97)	—	(2014)	—	(2014)		(0097)		9700	9700
152 (0x98)	˜	(02DC)	˜	(02DC)		(0098)		9800	9800
153 (0x99)	™	(2122)	™	(2122)		(0099)		9900	9900
154 (0x9A)	š	(0161)		(009A)		(009A)		9A00	9A00
155 (0x9B)	›	(203A)	›	(203A)		(009B)		9B00	9B00
156 (0x9C)	œ	(0153)		(009C)		(009C)		9C00	9C00
157 (0x9D)		(009D)		(009D)		(009D)		9D00	9D00
158 (0x9E)	ž	(017E)		(009E)		(009E)		9E00	9E00
159 (0x9F)	Ÿ	(0178)		(009F)		(009F)		9F00	9F00
160 (0xA0)		(00A0)		(00A0)		(00A0)		A000	A000
161 (0xA1)	¡	(00A1)	¡	(00A1)	¡	(00A1)	¡	A100	A100
162 (0xA2)	¢	(00A2)	¢	(00A2)	¢	(00A2)	¢	A200	A200
163 (0xA3)	£	(00A3)	£	(00A3)	£	(00A3)	£	A300	A300
164 (0xA4)	¤	(00A4)	₪	(20AA)	¤	(00A4)	¤	A400	A400
165 (0xA5)	¥	(00A5)	¥	(00A5)	¥	(00A5)	¥	A500	A500
166 (0xA6)	¦	(00A6)	¦	(00A6)	¦	(00A6)	¦	A600	A600
167 (0xA7)	§	(00A7)	§	(00A7)	§	(00A7)	§	A700	A700
168 (0xA8)	¨	(00A8)	¨	(00A8)	¨	(00A8)	¨	A800	A800
169 (0xA9)	©	(00A9)	©	(00A9)	©	(00A9)	©	A900	A900
170 (0xAA)	ª	(00AA)	×	(00D7)	ª	(00AA)	ª	AA00	AA00
171 (0xAB)	«	(00AB)	«	(00AB)	«	(00AB)	«	AB00	AB00
172 (0xAC)	¬	(00AC)	¬	(00AC)	¬	(00AC)	¬	AC00	AC00
173 (0xAD)		(00AD)		(00AD)		(00AD)		AD00	AD00
174 (0xAE)	®	(00AE)	®	(00AE)	®	(00AE)	®	AE00	AE00
175 (0xAF)	¯	(00AF)	¯	(00AF)	¯	(00AF)	¯	AF00	AF00
176 (0xB0)	°	(00B0)	°	(00B0)	°	(00B0)	°	B000	B000
177 (0xB1)	±	(00B1)	±	(00B1)	±	(00B1)	±	B100	B100
178 (0xB2)	²	(00B2)	²	(00B2)	²	(00B2)	²	B200	B200
179 (0xB3)	³	(00B3)	³	(00B3)	³	(00B3)	³	B300	B300
180 (0xB4)	´	(00B4)	´	(00B4)	´	(00B4)	´	B400	B400
181 (0xB5)	µ	(00B5)	µ	(00B5)	µ	(00B5)	µ	B500	B500
182 (0xB6)	¶	(00B6)	¶	(00B6)	¶	(00B6)	¶	B600	B600
183 (0xB7)	·	(00B7)	·	(00B7)	·	(00B7)	·	B700	B700
184 (0xB8)	¸	(00B8)	¸	(00B8)	¸	(00B8)	¸	B800	B800
185 (0xB9)	¹	(00B9)	¹	(00B9)	¹	(00B9)	¹	B900	B900
186 (0xBA)	º	(00BA)	÷	(00F7)	º	(00BA)	º	BA00	BA00
187 (0xBB)	»	(00BB)	»	(00BB)	»	(00BB)	»	BB00	BB00
188 (0xBC)	¼	(00BC)	¼	(00BC)	¼	(00BC)	¼	BC00	BC00
189 (0xBD)	½	(00BD)	½	(00BD)	½	(00BD)	½	BD00	BD00
190 (0xBE)	¾	(00BE)	¾	(00BE)	¾	(00BE)	¾	BE00	BE00
191 (0xBF)	¿	(00BF)	¿	(00BF)	¿	(00BF)	¿	BF00	BF00
192 (0xC0)	À	(00C0)	ְ	(05B0)	À	(00C0)	À	C000	C000
193 (0xC1)	Á	(00C1)	ֱ	(05B1)	Á	(00C1)	Á	C100	C100
194 (0xC2)	Â	(00C2)	ֲ	(05B2)	Â	(00C2)	Â	C200	C200
195 (0xC3)	Ã	(00C3)	ֳ	(05B3)	Ã	(00C3)	Ã	C300	C300
196 (0xC4)	Ä	(00C4)	ִ	(05B4)	Ä	(00C4)	Ä	C400	C400
197 (0xC5)	Å	(00C5)	ֵ	(05B5)	Å	(00C5)	Å	C500	C500
198 (0xC6)	Æ	(00C6)	ֶ	(05B6)	Æ	(00C6)	Æ	C600	C600
199 (0xC7)	Ç	(00C7)	ַ	(05B7)	Ç	(00C7)	Ç	C700	C700
200 (0xC8)	È	(00C8)	ָ	(05B8)	È	(00C8)	È	C800	C800
201 (0xC9)	É	(00C9)	ֹ	(05B9)	É	(00C9)	É	C900	C900
202 (0xCA)	Ê	(00CA)	ֺ	(05BA)	Ê	(00CA)	Ê	CA00	CA00
203 (0xCB)	Ë	(00CB)	ֻ	(05BB)	Ë	(00CB)	Ë	CB00	CB00
204 (0xCC)	Ì	(00CC)	ּ	(05BC)	Ì	(00CC)	Ì	CC00	CC00
205 (0xCD)	Í	(00CD)	ֽ	(05BD)	Í	(00CD)	Í	CD00	CD00
206 (0xCE)	Î	(00CE)	־	(05BE)	Î	(00CE)	Î	CE00	CE00
207 (0xCF)	Ï	(00CF)	ֿ	(05BF)	Ï	(00CF)	Ï	CF00	CF00
208 (0xD0)	Ð	(00D0)	׀	(05C0)	Ð	(00D0)	Ð	D000	D000
209 (0xD1)	Ñ	(00D1)	ׁ	(05C1)	Ñ	(00D1)	Ñ	D100	D100
210 (0xD2)	Ò	(00D2)	ׂ	(05C2)	Ò	(00D2)	Ò	D200	D200
211 (0xD3)	Ó	(00D3)	׃	(05C3)	Ó	(00D3)	Ó	D300	D300
212 (0xD4)	Ô	(00D4)	װ	(05F0)	Ô	(00D4)	Ô	D400	D400
213 (0xD5)	Õ	(00D5)	ױ	(05F1)	Õ	(00D5)	Õ	D500	D500
214 (0xD6)	Ö	(00D6)	ײ	(05F2)	Ö	(00D6)	Ö	D600	D600
215 (0xD7)	×	(00D7)	׳	(05F3)	×	(00D7)	×	D700	D700
216 (0xD8)	Ø	(00D8)	״	(05F4)	Ø	(00D8)	Ø	D800	D800
217 (0xD9)	Ù	(00D9)		(F88D)	Ù	(00D9)	Ù	D900	D900
218 (0xDA)	Ú	(00DA)		(F88E)	Ú	(00DA)	Ú	DA00	DA00
219 (0xDB)	Û	(00DB)		(F88F)	Û	(00DB)	Û	DB00	DB00
220 (0xDC)	Ü	(00DC)		(F890)	Ü	(00DC)	Ü	DC00	DC00
221 (0xDD)	Ý	(00DD)		(F891)	Ý	(00DD)	Ý	DD00	DD00
222 (0xDE)	Þ	(00DE)		(F892)	Þ	(00DE)	Þ	DE00	DE00
223 (0xDF)	ß	(00DF)		(F893)	ß	(00DF)	ß	DF00	DF00
224 (0xE0)	à	(00E0)	א	(05D0)	à	(00E0)	à	E000	E000
225 (0xE1)	á	(00E1)	ב	(05D1)	á	(00E1)	á	E100	E100
226 (0xE2)	â	(00E2)	ג	(05D2)	â	(00E2)	â	E200	E200
227 (0xE3)	ã	(00E3)	ד	(05D3)	ã	(00E3)	ã	E300	E300
228 (0xE4)	ä	(00E4)	ה	(05D4)	ä	(00E4)	ä	E400	E400
229 (0xE5)	å	(00E5)	ו	(05D5)	å	(00E5)	å	E500	E500
230 (0xE6)	æ	(00E6)	ז	(05D6)	æ	(00E6)	æ	E600	E600
231 (0xE7)	ç	(00E7)	ח	(05D7)	ç	(00E7)	ç	E700	E700
232 (0xE8)	è	(00E8)	ט	(05D8)	è	(00E8)	è	E800	E800
233 (0xE9)	é	(00E9)	י	(05D9)	é	(00E9)	é	E900	E900
234 (0xEA)	ê	(00EA)	ך	(05DA)	ê	(00EA)	ê	EA00	EA00
235 (0xEB)	ë	(00EB)	כ	(05DB)	ë	(00EB)	ë	EB00	EB00
236 (0xEC)	ì	(00EC)	ל	(05DC)	ì	(00EC)	ì	EC00	EC00
237 (0xED)	í	(00ED)	ם	(05DD)	í	(00ED)	í	ED00	ED00
238 (0xEE)	î	(00EE)	מ	(05DE)	î	(00EE)	î	EE00	EE00
239 (0xEF)	ï	(00EF)	ן	(05DF)	ï	(00EF)	ï	EF00	EF00
240 (0xF0)	ð	(00F0)	נ	(05E0)	ð	(00F0)	ð	F000	F000
241 (0xF1)	ñ	(00F1)	ס	(05E1)	ñ	(00F1)	ñ	F100	F100
242 (0xF2)	ò	(00F2)	ע	(05E2)	ò	(00F2)	ò	F200	F200
243 (0xF3)	ó	(00F3)	ף	(05E3)	ó	(00F3)	ó	F300	F300
244 (0xF4)	ô	(00F4)	פ	(05E4)	ô	(00F4)	ô	F400	F400
245 (0xF5)	õ	(00F5)	ץ	(05E5)	õ	(00F5)	õ	F500	F500
246 (0xF6)	ö	(00F6)	צ	(05E6)	ö	(00F6)	ö	F600	F600
247 (0xF7)	÷	(00F7)	ק	(05E7)	÷	(00F7)	÷	F700	F700
248 (0xF8)	ø	(00F8)	ר	(05E8)	ø	(00F8)	ø	F800	F800
249 (0xF9)	ù	(00F9)	ש	(05E9)	ù	(00F9)	ù	F900	F900
250 (0xFA)	ú	(00FA)	ת	(05EA)	ú	(00FA)	ú	FA00	FA00
251 (0xFB)	û	(00FB)		(F894)	û	(00FB)	û	FB00	FB00
252 (0xFC)	ü	(00FC)		(F895)	ü	(00FC)	ü	FC00	FC00
253 (0xFD)	ý	(00FD)	‎	(200E)	ý	(00FD)	ý	FD00	FD00
254 (0xFE)	þ	(00FE)	‏	(200F)	þ	(00FE)	þ	FE00	FE00
255 (0xFF)	ÿ	(00FF)		(F896)	ÿ	(00FF)	ÿ	FF00	FF00

Do Large Values / LOBs Work?

We now know what this function does to the bytes that are passed in. But how many bytes are allowed to be passed in? Some functions do not work with the MAX types. Could this be one of them? That should be easy to find out:

DECLARE @LargeValue VARBINARY(8000);

SET @LargeValue = CONVERT(VARBINARY(8000), REPLICATE('a', 4000));
SELECT DATALENGTH(@LargeValue) AS [InputSize],
       DATALENGTH(UNCOMPRESS(@LargeValue)) AS [OutputSize];
-- InputSize    OutputSize
--      4000          8000

SET @LargeValue = CONVERT(VARBINARY(8000), REPLICATE('a', 4001));
SELECT DATALENGTH(@LargeValue) AS [InputSize],
       DATALENGTH(UNCOMPRESS(@LargeValue)) AS [OutputSize];
-- InputSize    OutputSize
--      4001          NULL

Nope. While the input type is not a MAX type, the output type would have needed to have been NVARCHAR(MAX) in order to have held the 8002 bytes taken up by the 4001 characters.

Microsoft Access 2000 “Unicode Compression”

Mitch Schroeter suggested to me that perhaps the UNCOMPRESS function was intended to work on data coming directly from Access 2000 (or newer) and compressed via the WITH COMPRESSION option of the CREATE TABLE statement. The documentation for CREATE TABLE (for Microsoft Access, not SQL Server) states the following towards the end of the Remarks section:

The WITH COMPRESSION attribute was added for CHARACTER columns because of the change to the Unicode character representation format. Unicode characters uniformly require two bytes for each character. For existing Microsoft Jet databases that contain predominately character data, this could mean that the database file would nearly double in size when converted to the Microsoft Access database engine format. However, Unicode representation of many character sets, those formerly denoted as Single-Byte Character Sets (SBCS), can easily be compressed to a single byte. If you define a CHARACTER column with this attribute, data will automatically be compressed as it is stored and uncompressed when retrieved from the column.

While this does sound similar, it is not the exact same compression that the UNCOMPRESS function expects. There is some overlap in the behavior, but the UNCOMPRESS function is more simplistic than Access’s “Unicode Compression” (that term is in quotes because it is not true Unicode Compression). If Access was doing nothing more than removing the “0x00” bytes, then there would be no way to determine when to add them back in upon uncompressing; very few of the 65,536 two-byte code points have 0x00 bytes, so any algorithm will need to deal with non-compressible code points. There needs to be an indicator of some sort to tell the parser when a byte should be prefixed with a 0x00, appended with 0x00, or left alone. For example, if it encounters two bytes — 0xD5E2 — should the next two bytes of output be: 0x00D5, 0xD500, or 0xD5E2? We need more info to figure this out.

I found a technical description of the algorithm used to do the Microsoft Access style “Unicode Compression” and the resulting format. It’s described in this ticket for a Java-based MS Access library:

https://sourceforge.net/p/jackcess/bugs/111/

But that code at the top is from another library, libmdb, whereas the ticket is for Jackcess. The Jackcess implementation can be found here:

https://sourceforge.net/p/jackcess/code/HEAD/tree/jackcess/trunk/src/main/java/com/healthmarketscience/jackcess/impl/ColumnImpl.java#l1483

Between those two algorithms, it is clear that the 0x00 byte is used to toggle compression on and off. Here are the relevant lines of code in Jackcess (toggling handled by the three highlighted lines):

      boolean inCompressedMode = true;
      while(dataEnd < data.length) {
        if(data[dataEnd] == (byte)0x00) {

          // handle current segment
          decodeTextSegment(data, dataStart, dataEnd, inCompressedMode,
                            textBuf);
          inCompressedMode = !inCompressedMode;
          ++dataEnd;
          dataStart = dataEnd;

        } else {
          ++dataEnd;
        }

That is the indicator I was referring to. Knowing how the algorithm works, we can create a test to see if UNCOMPRESS can handle that format:

SELECT UNCOMPRESS(0x31323334616263313233);
-- 1234abc123

SELECT UNCOMPRESS(0x313233340061626300313233);
-- 1234

SELECT UNCOMPRESS(0xFFFE313233340061626300313233);
-- ÿþ1234

As you can see, it cannot. Without or without the “is the string compressed or not” indicator set (the 0xFFFE as the first two bytes in the third test, which is how Access knows if the data is compressed or not, and just happens to be the Byte Order Mark for UTF-16 LE), it interprets the 0x00 byte as the “null” string terminator (\0 in many languages) instead of turning off the compression and interpreting the next 2 bytes as a single UTF-16 code unit.

Side notes about MS Access “Unicode Compression”, not related to UNCOMPRESS

This Access approach to compression uses 0x00 bytes to toggle compression on and off, and the 0x00 bytes have been removed from the original data so this should be a reliable process, right? But what about Unicode code points in which the second byte is 0x00, such as code point U+5000 (i.e. “倀”)? If that 0x00 byte remains, then that would greatly complicate the parsing. But Access can store Unicode characters, so this scenario is being handled in some way. To figure out how, I created a test Access DB, created a table with a text field, and entered the following value (containing U+5000):

Bob***ŴĤŶ倀健伷倀Œ***

When I viewed the contents of the Access .accdb file, I saw the following bytes stored for that value:

4200 6F00 6200 2A00 2A00 2A00 7401 2401 7601 0050 6550 374F 0050 5001 2A00 2A00 2A00

B o b * * * Ŵ Ĥ Ŷ 倀健伷倀 Œ * * *

What’s interesting about that sequence of bytes is that it is not compressed (this field did not start with 0xFFFE). Yet entering the same string without the two U+5000 characters will show up as compressed in the data file (this field did start with 0xFFFE). This means that the MS Access way of handling strings containing code points in which the second byte is 0x00 is to simply not handle them.
The documentation quoted at the beginning of this section mentioned: “Unicode representation of many character sets, those formerly denoted as Single-Byte Character Sets (SBCS), can easily be compressed to a single byte“. This is incorrect. Yes, Unicode does represent many character sets. However, there are not multiple character sets that are being compressed into a single byte by this method of compression. As we saw earlier, in the Which Bytes Are Converted to UTF-16? section, only the first 256 code points can be compressed, and those are the same as the ISO-8859-1 Single Byte Character Set (which again, is roughly the same as the Windows-1252 character set). Yet the Hebrew characters, also from a Single Byte Character Set (Windows-1255), all mapped to code points above U+00FF and cannot be compressed (in this method). So, only a single character set, Latin 1, can be compressed (using this algorithm).

Official Unicode Compression

Just to be clear, the compressed format that UNCOMPRESS is working with is, again, a proprietary compression technique that is not in any way related to actual Unicode Compression. The official Unicode compression is described in “Unicode Technical Standard #6:
A STANDARD COMPRESSION SCHEME FOR UNICODE“. I did test with the Example 9.2 data from that Unicode.org page:

DECLARE @Russian VARBINARY(50) = 0x129CBEC1BAB2B0;
SELECT DATALENGTH(@Russian) AS [SourceByteCount],
       DATALENGTH(UNCOMPRESS(@Russian)) AS [UncompressedByteCount],
       CONVERT(VARBINARY(50), UNCOMPRESS(@Russian)) AS [UncompressedHex],
       UNCOMPRESS(@Russian) AS [UncompressedCharacters];
/*
SourceByteCount:            7
UncompressedByteCount:     14
UncompressedHex:           0x12009C00BE00C100BA00B200B000
UncompressedCharacters:    <DC2>œ¾Áº²°
*/

The uncompressed value is supposed to be: Москва

Conclusion

Observations

For some bytes, the “CP 1255” value is the same as the “UNCOMPRESS” value.
For most bytes, the “CP 1252” value is the same as the “UNCOMPRESS” value. It’s only characters in the 0x80 – 0x9F range that are different. That range is the C1 Control characters (more info: C1 Controls and Latin-1 Supplement and Latin-1 Supplement 0080—00FF).
For all bytes, the “UTF-16” value is the same as the “UNCOMPRESS” value.
This only works on the first 256 Unicode Code Points, so it is a very limited, simple form of compression.
This function is limited to returning at most 4000 characters / 8000 bytes.
There is no built-in function to produce the “compressed” binary from valid UTF-16 data.

Lessons Learned

The UNCOMPRESS function does nothing more than add a 0x00 byte to each byte passed in, the result of which is valid UTF-16 Little Endian (i.e. NVARCHAR ) data. Given the various limitations of this function, and the fact that you would have to write your own function to “compress” NVARCHAR data into this format (a simple CONVERT won’t work unless you can guarantee that none of the characters found in the 0x80 – 0x9F range exist in the input data):

DO NOT USE THIS FUNCTION!!

If you need compression because your data is mostly Code Points in the range of 0 – 255 / U+0000 – U+00FF / 0x00 – 0xFF, then look into Data Compression and Clustered Columnstore Indexes.

Does this function do essentially the same thing as UTF-8? Not really. UTF-8 only has the first 128 characters / Code Points (i.e. 0x00 – 0x7F) as single-byte characters. And if that is all your data is, then you can simply convert to VARCHAR. And while UTF-8 is supported natively starting in SQL Server 2019, there are limited scenarios where you should UTF-8 (within SQL Server, that is). And, even when you do save space, you will most likely sacrifice performance (to varying degrees). For more info on UTF-8 support in SQL Server, please see:

Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?

The one potentially valid use for UNCOMPRESS is if you have ISO-8859-1-encoded data, that is already in binary format, and is no more 4000 bytes / characters. This would work due to the first 256 Unicode Code Points being the ISO-8859-1 character set. However, even if you have data that fits this description, it would be better to convert it to Unicode / UTF-16 Little Endian prior to importing it into SQL Server. Or, if the data does not contain any bytes in the range of 0x80 – 0x9F, then just import it into a VARCHAR column that is using any of the various collations associated with code page 1252.

Mentions of UNCOMPRESS()

“Encrypt all objects” (SQLTeam forum) post by sterobhun on 2002-07-24 at 06:29:09
“Undocumented uncompress() function behaves strangely” (Azure Feedback)
“Undoing compression” (SQL Server Central forum)

Technically, some of the undocumented built-in functions, such as CLOUD_DATABASEPROPERTYEX , that return errors similar to: Msg 195, Level 15, State 10, Line XXXXX
‘CLOUD_DATABASEPROPERTYEX’ is not a recognized built-in function name. can be executed, but only from within a stored procedure marked as being a “system stored procedure”. However, system stored procedures are a topic for another day 😉. ↩
Just to be clear: the pattern of Unicode Code Point matching the UTF-16 Big Endian encoding only holds true for the first 65,536 Code Points (U+0000 – U+FFFF). ↩

What Does the Undocumented UNCOMPRESS Function Do?

Why Document?

What Now?

Single Character Tests

Two Character Tests

Confirmation

Which Bytes Are Converted to UTF-16?

Create and Populate Table

View the Evidence

Do Large Values / LOBs Work?

Microsoft Access 2000 “Unicode Compression”

Official Unicode Compression

Conclusion

Observations

Lessons Learned

Mentions of UNCOMPRESS()

Like this:

1 thought on “What Does the Undocumented UNCOMPRESS Function Do?”

Leave a ReplyCancel reply

Why Document?

What Now?

Single Character Tests

Two Character Tests

Confirmation

Which Bytes Are Converted to UTF-16?

Create and Populate Table

View the Evidence

Do Large Values / LOBs Work?

Microsoft Access 2000 “Unicode Compression”

Official Unicode Compression

Conclusion

Observations

Lessons Learned

Mentions of UNCOMPRESS()

Share this awesome post:

Like this:

1 thought on “What Does the Undocumented UNCOMPRESS Function Do?”

Leave a ReplyCancel reply

Discover more from Sql Quantum Leap