Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?

Simpsons_AngelAndDevilHomer

(last updated: 2018-10-17 @ 12:20 EDT / 2018-10-17 @ 16:20 UTC )

For some time now, many of us have struggled with data that is mostly standard US-English / ASCII characters but also needs to allow for the occasional non-ASCII character. VARCHAR is typically one byte per character but can’t represent a wide range of accented characters all at the same time (more than would be found on a single 8-bit Code Page). NVARCHAR , being a Unicode datatype, can represent all characters, but at a cost: each character is typically 2 bytes. When we have data that is most often standard ASCII, but has the potential (whether it ever happens or not) to have non-ASCII characters (names, URLs, etc), we have no choice but to use NVARCHAR. However, then we are wasting space for all of the data that could fit into VARCHAR and take up half as much space. While some claim that “disk is cheap“, wasting space negatively impacts query performance, backup size, backup and restore times, etc.

One approach is to have two columns for the data, one for each datatype. The idea is to store values that are 100% ASCII in the VARCHAR column, and anything else in the NVARCHAR. Then you can have a computed column return the non-NULL column for that data. While this approach does work, it is not something you are going to implement in 50 or more columns.

So, thankfully, Microsoft provided a solution to this problem.

And it is…

(drum-roll please)…

SQLCLR (introduced in SQL Server 2005). This allows one to use .NET to access GZip/Deflate functionality such that you can compress data into VARBINARY(MAX) on the way in, and uncompress back to NVARCHAR(MAX) on the way out. And you can even get pre-done versions of these — Util_GZip and Util_GUnzip —in the Free version of SQL# :-). Of course, some folks don’t like SQLCLR (for some reason), and you need to decompress in order to view / search / filter on the value, and this might be a bit much to do on many columns (though not as bad as having both VARCHAR and NVARCHAR versions of the column).

So, thankfully, Microsoft provided a solution to this problem.

And it is…

Data Compression (introduced in SQL Server 2008). But this was a bit simplistic, only available in Enterprise Edition, and didn’t work for off-row values (i.e. row overflow and MAX values that can’t fit in-row).

So, thankfully, Microsoft provided a solution to this problem.

And it is…

Unicode Compression (introduced in SQL Server 2008 R2, and implemented as part of Data Compression). But this was still only available in Enterprise Edition, and still didn’t work for off-row values, and in-row NVARCHAR(MAX) values only get the simplistic compression.

So, thankfully, Microsoft provided a solution to this problem.

And it is…

Compression for the all editions, not just Enterprise (as of SQL Server 2016, SP1). But, this still didn’t work for off-row values, and in-row NVARCHAR(MAX) values only get the simplistic compression.

So, thankfully, Microsoft provided a solution to this problem.

And it is…

COMPRESS and DECOMPRESS built-in functions (introduced in SQL Server 2016). You can use these to Gzip and Ungzip values (without resorting to SQLCLR), and store the much smaller VARBINARY value. Do this in stored procedures to be transparent to the caller. But this is not for everyone, and this might be a bit much to do on many columns.

So, thankfully, Microsoft provided a solution to this problem.

And it is…

Support for MAX types in Clustered Columnstore Indexes (as of SQL Server 2017). But, this is not for every scenario, especially because it affects the whole table.

So, thankfully, Microsoft provided a solution to this problem.

And it is…

UTF-8 Encodings (introduced in the upcoming SQL Server 2019). UTF-8 appears to be the ideal solution: standard ASCII characters take up only 1 byte, all Unicode characters are supported, the value is still a string (not binary) so it can be searched / filtered on / indexed / etc, and it is the preferred encoding for HTML and XML files. No need for an extra column plus a computed column, no need for SQLCLR or COMPRESS, no binary values, and no getting frustrated that Unicode Compression doesn’t work with NVARCHAR(MAX). What’s not to love?

Let’s explore…

TL; DR: While interesting, the new UTF-8 Collations only truly solve a rather narrow problem, and are currently too buggy to use with confidence, especially as a database’s default Collation. These encodings really only make sense to use with NVARCHAR(MAX) data, if the values are mostly ASCII characters, and especially if the values are stored off-row. Otherwise you are better off using Data Compression.

Basics

Documentation

Let’s start with what we are told about this new feature. According to the documentation, the new UTF-8 Collations:

  1. can be used …
    1. as a database-level default Collation
    2. as a column-level Collation
    3. by appending “_UTF8” to the end of any Supplementary Character-Aware Collation (i.e. either having “_SC” in their name, or being of level 140 or newer)
    4. with only the CHAR and VARCHAR
  2. (implied) have no effect on NCHAR and NVARCHAR data (meaning: for these types, the UTF-8 Collations behave the same as their non-UTF-8 equivalents
  3. “This feature may provide significant storage savings, depending on the character set in use.” (emphasis mine)

Installing

When installing SQL Server 2019, the default Collation (at least for systems with English as the OS-level language) is stillSQL_Latin1_General_CP1_CI_AS“. Why, why, WHY???? Why not use “Latin1_General_100_CI_AS_SC“, which has the same sensitivities but is newer, has sort weights for more characters, supports Supplementary Characters, and is a Windows Collation (hence you don’t prevent seeks on indexed VARCHAR columns when compared to Unicode data). Sure, changing the default Collation after all these years will result in some headaches when needing to work with existing systems (that didn’t bother to change the default Collation), but it ensures that new projects started by folks who don’t know to change it will spread farther and farther throughout the galaxy, tightening its stranglehold on us (i.e. making it even harder to move away from). But that has nothing to do with UTF-8, so we can ignore it for now ( <grrrrr> )

On the “Customize Database Engine Collation” screen, when selecting a Windows Collation:

  1. Collations without a version number do not allow checking either the “Supplementary characters” or “Variation selector-sensitive” options.
  2. Collations with version numbers “90” or “100” allow for checking the “Supplementary characters” option, but not the “Variation selector-sensitive” option.
  3. Collations with version number “140” have the “Supplementary characters” option checked (and cannot be unchecked), and allow for checking the “Variation selector-sensitive” option.
  4. There is no means of selecting a UTF-8 Collation, even if the “Supplementary characters” option is checked.

Using

By executing the queries below, we can confirm that adding _UTF8 to the end of Supplementary Character-Aware Collations does produce valid Collation names. I executed the following in [master] (which is guaranteed to not be using a UTF-8 Collation as its default Collation since there was no way to install SQL Server with one):

-- Version 90 (SQL 2005; _SC version of it started with SQL Server 2012)
SELECT 'a' COLLATE Japanese_90_CS_AS_SC_UTF8;

-- Version 100 (SQL 2008; _SC version of it started with SQL Server 2012)
SELECT 'a' COLLATE Latin1_General_100_CI_AS_SC_UTF8;

-- Version 140 (SQL 2017; no _SC version since it is implicitly supported)
SELECT 'a' COLLATE Japanese_XJIS_140_CS_AS_UTF8;

All of the above queries returned successfully.

The following query will list all of the UTF-8 Collations:

SELECT col.[name],
       COLLATIONPROPERTY(col.[name], 'CodePage') AS [CodePage],
       COLLATIONPROPERTY(col.[name], 'Version') AS [Version]
FROM   sys.fn_helpcollations() col
WHERE  col.[name] LIKE N'%[_]UTF8'
ORDER BY col.[name];
-- 1552 rows, all with CodePage = 65001

There are 5507 total Collations in SQL Server 2019, which is 1552 more than the 3955 that exist in SQL Server 2017. Hence the only new Collations are these “_UTF8” Collations.

If you execute the query above, do you notice anything missing? That’s right: no binary options for the “_UTF8” Collations (this is a mistake that needs to be addressed).

The following query shows that we can indeed use use VARCHAR to store Unicode characters, but only when using one of the new UTF-8 Collations:

SELECT NCHAR(27581), -- 殽
       CONVERT(VARCHAR(3), NCHAR(27581)
                           COLLATE Latin1_General_100_CI_AS_SC), -- ?
       CONVERT(VARCHAR(3), NCHAR(27581)
                           COLLATE Latin1_General_100_CI_AS_SC_UTF8); -- 殽

The following query confirms that the new UTF-8 Collations cannot be used with the TEXT datatype:

SELECT CONVERT(TEXT, N'a' COLLATE Latin1_General_100_CI_AS_SC_UTF8);

returns:

Msg 4189, Level 16, State 0, Line XXXXX
Cannot convert to text/ntext or collate to 'Latin1_General_100_CI_AS_SC_UTF8' because these legacy LOB types do not support UTF-8 or UTF-16 encodings. Use types varchar(max), nvarchar(max) or a collation which does not have the _SC or _UTF8 flags.

The following query shows that the NVARCHAR value is the same between the UTF-8 and non-UTF-8 Collation, and the same character can be held in VARCHAR, but only when using a UTF-8 Collation, and as a different underlying byte sequence:

DECLARE @Sample NVARCHAR(1) = NCHAR(0x5010);
SELECT @Sample; -- 倐

SELECT CONVERT(VARBINARY(2), @Sample
                    COLLATE Latin1_General_100_CI_AS_SC), -- 0x1050
       CONVERT(VARBINARY(2), @Sample
                    COLLATE Latin1_General_100_CI_AS_SC_UTF8), -- 0x1050

       CONVERT(VARCHAR(10), @Sample
                    COLLATE Latin1_General_100_CI_AS_SC), -- ?

       CONVERT(VARCHAR(10), @Sample
                    COLLATE Latin1_General_100_CI_AS_SC_UTF8), -- 倐
       CONVERT(VARBINARY(2), CONVERT(VARCHAR(10), @Sample
                    COLLATE Latin1_General_100_CI_AS_SC_UTF8)); -- 0xE580;

And, the following demonstrates that a UTF-8 Collation can be used at the database level:

CREATE DATABASE [UTF8] COLLATE Latin1_General_100_CI_AS_SC_UTF8;

ALTER DATABASE [UTF8] SET RECOVERY SIMPLE;

USE [UTF8];

The USE statement is there because the remaining examples that follow in this post will be executed from within the “UTF8” database.

General UTF-8 Info

UTF-16 (for context)

UTF-16, which is the Unicode encoding used for NVARCHAR data, uses 16-bit (i.e. 2-byte) values, known as “code units“, to represent characters. These “code units” are used individually (a 2-byte code unit), or in pairs (2 two-byte code units) to get Supplementary Characters, to map to individual “code points“. UCS-2, which only maps the BMP characters (i.e. the first 65,536 code points), is a fixed-length encoding because it only deals with single code units. UTF-16, which uses the exact same code units as UCS-2, can use certain combinations of two of those code units (called Surrogate Pairs) to map the non-BMP characters (i.e. code points 65,537 and above, known as Supplementary Characters).

Because these values are two bytes, they are subject to “endianness”, which refers to the order in which the individual bytes that make up the value are stored internally (in memory, on disk, etc). The hardware determines the “endianness” of how multi-byte values (including INT, SMALLINT, BIGINT) are stored, and the two possibilities are “Big Endian”, which is the given / natural ordering (i.e. value 0x1234 — where 0x12 is byte #1, and 0x34 is byte #2 — is stored as 0x1234), or “Little Endian”, which refers to storing the bytes in reverse-order (i.e. value 0x1234 is stored as 0x3412).

Unicode strings in Windows, “string” in .NET, and NVARCHAR in SQL Server are all UTF-16 Little Endian.

UTF-8

UTF-8 is another Unicode encoding which uses between one and four 8-bit (i.e. 1-byte) code units to represent characters. Even though it is Unicode, due to working 8-bits at a time, this encoding is considered VARCHAR data. This is also a variable-length encoding because it requires anywhere from 1 to 4 code units to map to any given code point. Due to these code units being single bytes (even in combinations of 2, 3, or 4), there is no concept of “endianness”.

One main advantage of UTF-8 is that it’s fully compatible with any existing data that’s comprised entirely of 7-bit ASCII characters (values 0 – 127). This is by design, but not really a reason for SQL Server to use it for storing data internally. The other main advantage of UTF-8 is that it can take up half the space while still allowing for the full range of Unicode characters. This would be a reason for SQL Server to use it internally. BUT, it only takes up half the space for a small subset of characters. Of course, if the characters in this small subset that are only 1 byte are the vast majority of characteres are being used, then this encoding does provide for significant space savings (in terms of disk space, memory, and reduced network traffic). What people tend to overlook, though, is that most Unicode characters require 3 bytes, which is 1 more byte than UTF-16 for those same characters.

You can execute the following query in SQL Server 2019 to see how many bytes each character requires for both UTF-8 and UTF-16 encodings. It returns all 65,536 BMP (Base Multilingual Plan) characters (which is also the entire UCS-2 character set), and 3 Supplementary Characters. Since all Supplementary Characters are 4 bytes in both encodings, there is no need to return more of them, but we do need to see a few of them to see that they are a) all 4 bytes, and b) encoded slightly differently. The query will work the same regardless of whether or not the current database uses a “_UTF8” Collation. It also does not matter if the database’s default Collation is Supplementary Character-Aware or not.

;WITH nums([CodePoint]) AS
(
  SELECT TOP (65536) (ROW_NUMBER() OVER (ORDER BY (SELECT 0)) - 1)
  FROM   [master].[sys].[columns] col
  CROSS JOIN [master].[sys].[objects] obj
), chars AS
(
  SELECT nums.[CodePoint],
         CONVERT(VARCHAR(4), NCHAR(nums.[CodePoint])
                    COLLATE Latin1_General_100_CI_AS_SC_UTF8) AS [TheChar],
         CONVERT(VARBINARY(4), CONVERT(VARCHAR(4), NCHAR(nums.[CodePoint])
                    COLLATE Latin1_General_100_CI_AS_SC_UTF8)) AS [UTF8]
  FROM   nums
  UNION ALL
  SELECT tmp.val,
         CONVERT(VARCHAR(4), CONVERT(NVARCHAR(5), tmp.hex)
                    COLLATE Latin1_General_100_CI_AS_SC_UTF8) AS [TheChar],
         CONVERT(VARBINARY(4), CONVERT(VARCHAR(4), CONVERT(NVARCHAR(5),
              tmp.hex) COLLATE Latin1_General_100_CI_AS_SC_UTF8)) AS [UTF8]
  FROM   (VALUES (65536, 0x00D800DC), -- Linear B Syllable B008 A (U+10000)
                 (67618, 0x02D822DC), -- Cypriot Syllable Pu (U+10822)
                 (129384,0x3ED868DD)  -- Pretzel (U+1F968)
         ) tmp(val, hex)
)
SELECT chr.[CodePoint],
       COALESCE(chr.[TheChar], N'TOTALS:') AS [Character],
       chr.[UTF8] AS [UTF8_Hex],
       DATALENGTH(chr.[UTF8]) AS [UTF8_Bytes],
       COUNT(CASE DATALENGTH(chr.[UTF8]) WHEN 1 THEN 'x' END) AS [1-byte],
       COUNT(CASE DATALENGTH(chr.[UTF8]) WHEN 2 THEN 'x' END) AS [2-bytes],
       COUNT(CASE DATALENGTH(chr.[UTF8]) WHEN 3 THEN 'x' END) AS [3-bytes],
       COUNT(CASE DATALENGTH(chr.[UTF8]) WHEN 4 THEN 'x' END) AS [4-bytes],
      ---
      CONVERT(VARBINARY(4), CONVERT(NVARCHAR(3), chr.[TheChar]))
         AS [UTF16(LE)_Hex],
      DATALENGTH(CONVERT(NVARCHAR(3), chr.[TheChar])) AS [UTF16_Bytes],
      ---
      ((DATALENGTH(CONVERT(NVARCHAR(3), chr.[TheChar]))) -
            (DATALENGTH(chr.[TheChar]))) AS [UTF8savingsOverUTF16]
FROM   chars chr
GROUP BY ROLLUP ((chr.[CodePoint], chr.[TheChar], chr.[UTF8]));

There isn’t enough room here to show all 65,538 rows, so here is a summary of the info:

Code Points Code Point Quantity UTF-8 Size (bytes) UTF-16 Size (bytes) UTF-8 savings
0 – 127 128 1 2 1
127 – 2047 1920 2 2 0
2048 – 65535 63488 3 2 -1
65536 – lots 4 4 0

As you can see, only the first 128 Code Points offer any size savings over UTF-16. The characters in that range are (minus the first 32 which are control characters, and the last character):

[space] ! " # $ % & ' (
) * + , . / 0 – 9 :
; < = > ? @ A – Z [ \
] ^ _ ` a – z { | } ~

Those characters, and their values (32 – 126), are exactly the same as the standard ASCII character set, and are the same across most code pages (certainly all code pages supported by SQL Server). Any characters not in the chart above do not offer any space savings, with most of the BMP characters actually taking up more space than they would if using UTF-16 (i.e. NVARCHAR ).

Setting and Storing UTF-8 Values

An encoding is how the characters are represented internally. Whether the source is a non-Unicode string, a Unicode string, or a binary string (i.e. series of bytes / hex bytes), the end result will be the chosen encoding’s value for that character.

First, let’s try setting a variable, in the “[UTF8]” database, from a VARBINARY literal:

DECLARE @HotDog VARCHAR(4) = 0xF09F8CAD;
PRINT @HotDog; -- ( U+1F32D )

Returns:   🌭 (in the “Messages” tab)

As you can see, it produces the expected character from the given 4 bytes. However, those same 4 bytes do not produce the correct character when stored as NVARCHAR , which expects UTF-16 Little Endian bytes sequences:

DECLARE @NotHotDog NVARCHAR(4) = 0xF09F8CAD;
PRINT @NotHotDog; -- ( U+1F32D )
-- 鿰권

Now let’s look at how the size of the (N)VARCHAR types impact what can be stored in them:

SELECT CONVERT( VARCHAR(3), 0xF09F8CAD) AS [TooSmallForUTF8-HotDog],
     DATALENGTH(CONVERT(VARCHAR(3), 0xF09F8CAD)) AS [ByteCountForTooSmall],
       CONVERT( VARCHAR(4), 0xF09F8CAD) AS [UTF8-HotDog],
       DATALENGTH(CONVERT(VARCHAR(4), 0xF09F8CAD)) AS [ByteCountForHotDog],
       ------
       CONVERT(NVARCHAR(1), 0x3CD82DDF) AS [TooSmallForUTF16-HotDog],
       CONVERT(NVARCHAR(2), 0x3CD82DDF) AS [UTF16-HotDog];

Returns:

TooSmallFor
UTF8-HotDog
ByteCountFor
TooSmall
UTF8-
HotDog
ByteCountFor
HotDog
TooSmallFor
UTF16-HotDog
UTF16-
HotDog
0 🌭 4 🌭

The first column appears to return nothing, not even a partial, incorrect / unknown character. The second column confirms that there is absolutely nothing resulting from attempting to store a 4-byte UTF-8 value in a VARCHAR type that is only 3 bytes. It couldn’t even take just the first 3 bytes of the 4-byte value. The third and fourth columns show that given the type the required 4 bytes allows everything to work as expected.

On the other hand, when attempting to store the same character, using the UTF-16 4-byte sequence, into an NVARCHAR type that is too small, the result is the default Replacement Character ( U+FFFD ). This is because it was able to store the first half of the Surrogate Pair (the 0x3CD8), which is meaningless on its own, but it is still a valid byte sequence.

Size and Performance

{ This section is taking a while as I figure out how to best present the variety of tests and their results, but the general info is all there. }

Data Profiles

  1. Some ASCII (values 1 – 254), but mostly non-ASCII
    • [N]VARCHAR(500)
    • 10,539 rows
    • Characters per row: MIN = 1; AVG = 14; MAX = 30
  2. Some ASCII MAX
    • [N]VARCHAR(MAX)
    • 10,539 rows
    • Characters per row: MIN = 1; AVG = 14; MAX = 30
       
  3. All ASCII
    • [N]VARCHAR(500)
    • 105,390 rows
    • Characters per row: MIN = 6; AVG = 24; MAX = 60
  4. All ASCII MAX (all in row)
    • [N]VARCHAR(MAX)
    • 105,390 rows
    • Characters per row: MIN = 6; AVG = 24; MAX = 60
       
  5. All ASCII MAX (10k characters per row; all off-row)
    • [N]VARCHAR(MAX)
    • 10,539 rows
    • Characters per row: MIN = 10,007; AVG = 10,025; MAX = 10,061
  6. All ASCII MAX (5k characters per row; mixed in / off -row; UTF-8 = in-row, NVARCHAR = off-row)
    • [N]VARCHAR(MAX)
    • 21,078 rows
    • Characters per row: MIN = 5,007; AVG = 5,025; MAX = 5,061

Tests:

These were executed in a database which has a “_UTF8” default Collation.

SELECT REVERSE([Value])
FROM dbo.[{table}];


SELECT SUBSTRING([Value], 5, 13)
FROM dbo.[{table}];


SELECT [Value]
FROM dbo.[{table}]
WHERE [Value] = 'abcdef12345'; -- UTF16 test uses N'abcdef12345'


SELECT [Value]
FROM dbo.[{table}]
WHERE [Value] LIKE '%123%'; -- UTF16 test uses N'%123%'


SELECT MIN(LEN([Value])), AVG(LEN([Value])), MAX(LEN([Value]))
FROM   dbo.[{table}];


SELECT [Value]
FROM   dbo.[{table}];

Each test was wrapped in:

SET STATISTICS TIME ON;
...{test}...
SET STATISTICS TIME OFF;

Results

  1. Some ASCII in non-MAX types
    • NVARCHAR is smaller here when Compression = NONE, ROW, or PAGE. COLUMNSTORE estimated better as well, but not tested.
    • Did not test performance given no space savings
  2. Some ASCII in MAX types (in-row)
    • NVARCHAR is smaller here when Compression = NONE, ROW, or PAGE. COLUMNSTORE estimated the same, but not tested.
    • Did not test performance given no space savings
       
  3. All ASCII in non-MAX types
    • UTF-8 is smaller here when Compression = NONE (approx. 37% smaller than NVARCHAR)
    • UTF-8 is only slightly smaller here when Compression = ROW or PAGE (approx. 1% smaller)
    • NVARCHAR is slightly smaller here when table is COLUMNSTORE (approx. 1% smaller)
    • Performance of NVARCHAR is same or better than UTF-8
  4. All ASCII in MAX types (in-row)
    • UTF-8 is smaller here when Compression = NONE (approx. 37% smaller than NVARCHAR; 40% for ROW and PAGE)
    • UTF-8 is only slightly smaller here when table is CLUSTERED COLUMNSTORE (approx. 9% smaller)
    • Performance is sometimes the same, but NVARCHAR is usually better than UTF-8
  5. All ASCII in MAX types (10k; off-row)
    • UTF-8 is smaller here when Compression = NONE (approx. 46% smaller)
    • ROW and PAGE compression do not apply to off-row data
    • UTF-8 is only slightly smaller here when table is CLUSTERED COLUMNSTORE (approx. 1.3% smaller). The NVARCHAR table is still 98% smaller than it was originally.
    • With no compression, NVARCHAR is faster for all tests.
    • When table is CLUSTERED COLUMNSTORE, NVARCHAR is faster for all tests except “SUBSTRING” and “=”.
  6. All ASCII in MAX types (5k; mixed on / off-row: VARCHAR = on, NVARCHAR = off)
    • UTF-8 is smaller here when Compression = NONE, ROW, or PAGE (approx. 25% smaller)
    • UTF-8 is much smaller here when table is CLUSTERED COLUMNSTORE (approx. 92% smaller). BUT, this is typically a non-issue since the NVARCHAR table is still 96% smaller than it was originally. This becomes an issue as the uncompressed data gets closer to 100 GB.
    • With no compression, NVARCHAR is faster for all tests except “LIKE ‘%x%'”.
    • When table is CLUSTERED COLUMNSTORE, UTF-8 is faster for all tests except REVERSE (elapsed time was the same, but CPU was better)

Typically there is at least a slight performance hit when using UTF-8. I believe the issue is due to UTF-8 being a variable-length encoding, which means that each byte must be interpreted as it is read in order to know if it is a complete character or if the next byte is a part of it. This means that all string operations need to start at the beginning and proceed byte-by-byte. On the other hand, NVARCHAR / UTF-16 is always 2 bytes (even Supplementary Characters are comprised of two 2-byte Code Points), so everything can be read in 2-byte chunks.

The area of biggest improvement of UTF-8 over NVARCHAR is when there is between 4001 and 8000 bytes of mostly standard ASCII characters. In this scenario, you would need to use NVARCHAR(MAX) to get over 4000 bytes, but then you lose the ability to use Unicode Compression. And, being over 4000 bytes, the NVARCHAR data must be stored off-row (a slight performance hit), while that same data can be stored in-row when using UTF-8. This is the only scenario where UTF-8 is better for both space and performance.

Limitations

Known (i.e. Stated in Documentation)

The UTF8 Collations do not work with:

  1. Replication
  2. Linked Servers
  3. In-memory OLTP
  4. External Table for Polybase
  5. UTF8 Collations cannot be used as the Instance-level Collation (** please see next section for the “EXCEPT, …”):
    • Currently wording is that they can only be used at the database and column level, but yes, they can also be used inline via the COLLATE keyword.
    • Not an option via the installer (as mentioned at the top of this post)
    • Cannot be specified via:
      SETUP.EXE /QUIET /ACTION=REBUILDDATABASE ^
      /SQLCOLLATION=Latin1_General_100_CI_AS_SC_UTF8 ^
      ...

      The result is:
      The following error occurred:
      The collation Latin1_General_100_CI_AS_SC_UTF8 was not found.
       
      Error result: -2061893630
      Result facility code: 1306
      Result error code: 2

Unknown (i.e. Oops)

  1. UTF8 Collations can be used as the Instance-level Collation. This cannot be set via the installer or running SETUP.EXE with the /SQLCOLLATION switch, BUT they can be used via the undocumented method (of changing all Collations for an Instance):
    sqlservr -c -m -T4022 -T3659 -s"{instane_name}" ^
    -q"Latin1_General_100_CI_AS_SC_UTF8"`

    Yes, this does work. I have tested it. For more details on this method, please see:
    Changing the Collation of the Instance, the Databases, and All Columns in All User Databases: What Could Possibly Go Wrong?
    PLEASE NOTE: Just because you can do this doesn’t mean that you should do this. Given that the only way of setting an Instance to use a UTF8 Collation is via an undocumented method, it is quite possible that this has never been tested internally at Microsoft and is not guaranteed to work. It might work perfectly, or there could be unexpected behaviors. Use at your own risk.
     
  2. There are no binary Collations that support the UTF8 encoding!!! Why not?. But why do we need them? Here are some reasons:
    1. Microsoft’s implementation of Unicode has incomplete, and some incorrect, sort weights. One example:
      Choosing a binary collation that can differentiate between ‘ss’ and ‘ß’ for nvarchar column in Sql Server
    2. When needing to know if two values are 100% identical, you need a binary Collation (such as when auditing for changes, etc). Even if all characters had correct sort weights, you still wouldn’t be able to differentiate between a base character plus a combining diacritic and the pre-composed single character equivalent. Nor would you be able to differentiate between a base character plus multiple combining diacritics and the same characters with a different order for the diacritics. In both of those cases, the visual representation of the “character” is the same, but the number or order of the underlying Code Points would be different. The first item from this post provides a working example: No, Binary Collations are not Case-Sensitive
    3. When storing alphanumeric codes (e.g. airline confirmation codes, phone numbers, postal codes, etc), binary sorting and comparison is much faster since it ignores all linguistic rules, which is perfect for scenarios such as these where there is no expectation of applying such rules.
    4. Finding / removing “(null)” character U+0000 / CHAR(0)
       
  3. Bugs:
    1. NVARCHAR to VARCHAR(MAX) can get Msg 8152 “String or binary data would be truncated”
    2. Invalid UTF-8 bytes get (Msg 682, Level 22, State 148) Internal error. Buffer provided to read column value is too small. OR get mysterious, changing 3rd byte of 2-byte value
    3. Inconsistent handling of invalid UTF-8 byte sequences
    4. CHAR() function broken in databases with “_UTF8” default Collation (not a bug after all; just needs better documentation)
    5. An error occurred while executing batch. Error message is: Internal connection fatal error. Error state: 15, Token : 2″ (disconnect) — caused by NULL UTF-8 values
    6. Collation Precedence allows for Unicode character data loss concatenating UTF8 with higher precedence non-UTF8 VARCHAR

Recommended Uses / Guidance

The UTF-8 encoding, being a variable-length encoding, can be a huge benefit in some scenarios, but it can also make things worse in others. Unfortunately, there is very little use for a “_UTF8” encoding given that Data Compression and Clustered Columnstore Indexes are available across all editions of SQL Server. The only scenario that truly benefits from a UTF-8 encoding is one in which all of the following conditions are true:

  1. Data is mostly standard ASCII (values 0 – 127), but either has, or might have, a small amount of a varying range of Unicode characters (more than would be found on a single 8-bit Code Page, or might not exist on any 8-bit Code Page)
  2. Column is currently (or otherwise would be) NVARCHAR(MAX) (meaning, data won’t fit into NVARCHAR(4000))
  3. There is a lot of data for this column or set of columns (1 GB or more when stored in NVARCHAR)
  4. Performance would be negatively impacted by making the table a Clustered Columnstore table (due to how the table is used) OR data is typically < 8000 bytes
  5. There is no desire to make the column VARBINARY(MAX), use COMPRESS() for INSERT and UPDATE operations, and use DECOMPRESS() for SELECT queries (no need to worry about lack of ability to index the VARBINARY value since it is MAX data anyway that cannot be indexed). Keep in mind that the Gzipped value will be much smaller than even the UTF-8 version of the string, though it would require decompressing before the value could be filtered on (outside of “=”) or manipulated.
  6. The benefits of reducing the size of backups and reducing the time it takes to backup and restore, and reducing the impact on the buffer pool, outweigh the cost of the likely negative impact on query performance (for both CPU and elapsed times). Just keep in mind that Backup Compression (available in Enterprise and Standard Editions) might help here.

Storing HTML pages is a good example of a scenario that fits this description. UTF-8 is, of course, the preferred encoding for the interwebs precisely due to it using the minimal space for the most common characters while still allowing for the full range of Unicode characters.

Conclusion

Adding support for UTF-8 is certainly interesting. However, being the preferred encoding for the web, and even the default encoding for Linux / Unix (at least some flavors), doesn’t imply that UTF-8 is truly usefull within SQL Server. UTF-8 is a great choice for scenarios where there is:

  1. a need to be fully compatible with standard ASCII (to work with legacy systems / code)
    and / or
    a desire to at least save space when most data is standard ASCII anyway,
  2. a need / desire to support all characters (especially without needing to change Code Pages), and
  3. no ability to use Unicode Compression.

So, within the context of SQL Server, UTF-8 has the potential for helping a very specific, but not unimportant, use case: NVARCHAR(MAX) columns, where Unicode Compression doesn’t work. Still, I think the development / testing time spent on this would have been much better applied to something more useful to more users, such as the following (the first two of which would do a better job of accomplishing the space-saving goal of UTF-8, and without sacrificing performance):

  • getting Unicode Compression (part of Data Compression) to work on at least in-row values of NVARCHAR(MAX), if not also off-row values (please vote for: “unicode compression nvarchar(max)” — this would save more space than UTF-8)
  • getting Data Compression to work for off-row LOB / Row-Overflow values (please vote for: “Page Compression for Out-of-Row LOB Storage” — this would save space for more than just Unicode strings: it would also apply to VARCHAR, VARBINARY, and XML)
  • getting mixed alphanumeric values to sort as expected (i.e. “natural sorting” — “21a” sorts after “3a” instead of before — please vote for: “Support "natural sorting" / DIGITSASNUMBERS as a Collation option“)

Given that it is not fun to change Collations for databases, I would strongly recommend against using the new UTF-8 Collations, at least not at the database level, and definitely not in Production. With the obviously missing binary versions, and 5 bugs found in only a couple of days of testing for this post (not testing a full, complex application), this feature is incomplete, and there might yet be more bugs / odd behavior. I would suggest, if it looks like it will take quite a bit of development time to fix these six issues, then perhaps just remove this feature entirely (no big loss as it provides very little, if any, benefit) and spend that time on one of the three items noted directly above (think of how many customers / projects would benefit from Data Compression working for off-row LOB / overflow data).

If you are going to use a UTF-8 Collation on one or more columns, make no assumptions about size reduction or better performance! Be sure to capture current size and performance stats while column is NVARCHAR, and then compare both after changing to UTF-8.

Resources

4 thoughts on “Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?”

  1. As someone who has dedicated hours of personal research into understanding the differences between Unicode and UTF-8 in general, and with Sql Server in particular, your opening paragraphs are fantastically clear and well explained. Much appreciated. Working with federal data that is english only, we too have been plagued with the issue of 99% of strings are varchar but occassionally some esoteric hawaian characters not covered by the Latin collation show up. Here’s hoping the Sql Server team addresses the issues you found and get’s this right before final release.

    Liked by 1 person

  2. It’s worth mentioning that ISO has some conventions where their encodings, such as metric units, have to use the basic Latin alphabet and a set of punctuation marks, which are roughly UTF–8. Unicode is required to support this little subset in all of its languages, so that standard encodings can be used anywhere on earth.

    Like

    1. Hi Joe. I think you are misstating / misunderstanding what is going on here. I thought I had pointed out how the 1-byte range of UTF-8 (which is really only half of what 1 byte can represent) is exactly the same as 7-bit US-ASCII. It seems that I did not include that info before, but I have updated in a few places to be clearer about that association. That range, especially values 32 – 126 (see the chart just above the “Setting and Storing UTF-8 Values” section), are common across most code pages / character sets. That commonality across code pages is what made those particular characters become the only allowable characters across many standards and RFCs, many of which were published prior to the existence of UTF-8, which came out circa 1996. For example, the RFC for SMTP, published in 1982, was updated in 2008 and still refers to US-ASCII: RFC for SMTP (search for “ascii”).

      One of the design requirements of UTF-8 was that it work seemlessly on existing systems (UNIX in particular), without needing to convert any existing text files. This is why characters encoded using just the first 7 bits of UTF-8 match exactly to the 7-bit US-ASCII encoding. So it’s not that the subset is “roughly UTF-8”, since UTF-8 can encode the entire 1.1+ million character range of Unicode; it’s that a subset of UTF-8 is exactly standard ASCII.

      Also, Unicode doesn’t have languages; it has encodings. And there are only three (mainly): UTF-8, UTF-16, and UTF-32. But, due to endianness, both UTF-16 and UTF-32 have two different physical representations, making it 5 encodings in practice. BUT, it is only UTF-8 that has the ability to encode that subset of characters to be physically the same as 7-bit standard ASCII. For example, Latin Upper-case “A” is 0x65 in both US-ASCII and UTF-8. Yet it is either 0x0065 or 0x6500 in UTF-16, depending on endianness. While the non-zero byte of UTF-16 is the same as the only byte in UTF-8 and US-ASCII, it’s still 2 bytes vs 1 byte, and that makes a big difference when parsing a series of bytes.

      Finally, it is not the encodings that are standard (since there are 5 of them). It is the character set of Unicode that is truly standard, as it is the same across all 5 encodings. For example, “Upside-Down Face” is Code Point U+1F643 across all five encodings (see here – scroll down to the “Encodings” section).

      Like

Thoughts? Questions? Comments? Suggestions? Words of praise?

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s