Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?

Simpsons_AngelAndDevilHomer

(last updated: 2019-04-08 @ 02:00 EST / 2019-04-08 @ 06:00 UTC )

(NOTE: For recent update, please see the “Update for CTP 2.2” section)
(NOTE: For recent update, please see the “Real Purpose” section)
(NOTE: For recent update, please see the “Update for CTP 2.3” section)
(NOTE: For recent update, please see the “Update for CTP 2.4” section)
(NOTE: For a list of known issues and their current status, please see the “Bugs / Restrictions” section)

 

For some time now, many of us have struggled with data that is mostly standard US-English / ASCII characters but also needs to allow for the occasional non-ASCII character. VARCHAR is typically one byte per character but can’t represent a wide range of accented characters all at the same time (more than would be found on a single 8-bit Code Page). NVARCHAR , being a Unicode datatype, can represent all characters, but at a cost: each character is typically 2 bytes. When we have data that is most often standard ASCII, but has the potential (whether it ever happens or not) to have non-ASCII characters (names, URLs, etc), we have no choice but to use NVARCHAR. However, then we are wasting space for all of the data that could fit into VARCHAR and take up half as much space. While some claim that “disk is cheap“, wasting space negatively impacts query performance, backup size, backup and restore times, etc.

One approach is to have two columns for the data, one for each datatype. The idea is to store values that are 100% ASCII in the VARCHAR column, and anything else in the NVARCHAR. Then you can have a computed column return the non-NULL column for that data. While this approach does work, it is not something you are going to implement in 50 or more columns.

So, thankfully, Microsoft provided a solution to this problem.

And it is…

(drum-roll please)…

SQLCLR (introduced in SQL Server 2005). This allows one to use .NET to access GZip/Deflate functionality such that you can compress data into VARBINARY(MAX) on the way in, and uncompress back to NVARCHAR(MAX) on the way out. And you can even get pre-done versions of these — Util_GZip and Util_GUnzip —in the Free version of SQL# :-). Of course, some folks don’t like SQLCLR (for some reason), and you need to decompress in order to view / search / filter on the value, and this might be a bit much to do on many columns (though not as bad as having both VARCHAR and NVARCHAR versions of the column).

So, thankfully, Microsoft provided a solution to this problem.

And it is…

Data Compression (introduced in SQL Server 2008). But this was a bit simplistic, only available in Enterprise Edition, and didn’t work for off-row values (i.e. row overflow and MAX values that can’t fit in-row).

So, thankfully, Microsoft provided a solution to this problem.

And it is…

Unicode Compression (introduced in SQL Server 2008 R2, and implemented as part of Data Compression). But this was still only available in Enterprise Edition, and still didn’t work for off-row values, and in-row NVARCHAR(MAX) values only get the simplistic compression.

So, thankfully, Microsoft provided a solution to this problem.

And it is…

Compression for the all editions, not just Enterprise (as of SQL Server 2016, SP1). But, this still didn’t work for off-row values, and in-row NVARCHAR(MAX) values only get the simplistic compression.

So, thankfully, Microsoft provided a solution to this problem.

And it is…

COMPRESS and DECOMPRESS built-in functions (introduced in SQL Server 2016). You can use these to Gzip and Ungzip values (without resorting to SQLCLR), and store the much smaller VARBINARY value. Do this in stored procedures to be transparent to the caller. But this is not for everyone, and this might be a bit much to do on many columns.

So, thankfully, Microsoft provided a solution to this problem.

And it is…

Support for MAX types in Clustered Columnstore Indexes (as of SQL Server 2017). But, this is not for every scenario, especially because it affects the whole table.

So, thankfully, Microsoft provided a solution to this problem.

And it is…

UTF-8 Encodings (introduced in the upcoming SQL Server 2019). UTF-8 appears to be the ideal solution: standard ASCII characters take up only 1 byte, all Unicode characters are supported, the value is still a string (not binary) so it can be searched / filtered on / indexed / etc, and it is the preferred encoding for HTML and XML files. No need for an extra column plus a computed column, no need for SQLCLR or COMPRESS, no binary values, and no getting frustrated that Unicode Compression doesn’t work with NVARCHAR(MAX). What’s not to love?

BUT WAIT! What if the UTF-8 collations weren’t added to solve this problem? What if they were added to solve a different problem, and saving space was just an added benefit? Would that change anything?

Let’s explore…

TL; DR: If you are not absolutely certain that you should be using the UTF-8 collations, then do not use them!

While interesting, the new UTF-8 Collations only truly solve a rather narrow problem (or possibly two narrow problems), and are currently too buggy to use with confidence, especially as a database’s, or instance’s, default Collation. For the purpose of compression, these encodings really only make sense to use with NVARCHAR(MAX) data, if the values are mostly ASCII characters, and especially if the values are stored off-row. Otherwise, you are better off using Data Compression, or possibly Clustered Columnstore Indexes.
 

Then why does this feature even exist? Why devote development resources to it? What problem(s) does it solve? It wasn’t needed for importing or exporting data since that was resolved two versions ago (in SQL Server 2016, or possibly even SQL Server 2014 SP2). And I certainly hope it wasn’t to save space (compared to NVARCHAR / UTF-16) since that isn’t what UTF-8 was designed for (or does, generally speaking). The reason why UTF-8 (as an encoding, irrespective of SQL Server) was created was to address compatibility (with existing ASCII-based systems) 1 , not efficiency (of space or speed) 2 . In fact, compatibility was achieved at the expense of efficiency, given that most characters take up more space in UTF-8 than they do in UTF-16. But, that was an acceptable trade-off for its intended use: allowing existing OSes (primarily Unix but soon also Windows) to be Unicode-capable while not breaking any existing functionality. UTF-8 sometimes saving space over UTF-16 (but only for 128 characters, mind you) is a side-effect, not a design goal. Which means that it’s good to know of the option, but it shouldn’t be used as a selling point for this feature (yet so far it’s the onlymain thing mentioned regarding this feature).
 

HOWEVER, info provided three months after CTP 2.0 was released indicates that the real purpose of the UTF-8 collations actually is compatibility after all. UTF-8 collations will allow apps working with VARCHAR data and datatypes to become Unicode capable with relatively few changes to the app code. This would also benefit projects that work with other database systems that already support UTF-8 in VARCHAR columns, and without requiring the upper-case “N” prefix on literals. This is an understandable goal, and is at least keeping with the purpose of UTF-8 (i.e. compatibility). But, they can be a bit clunky to work with due to a few outstanding bugs, and the fact that so much of the system assumes VARCHAR data to be one byte per character, while most characters are 3 bytes in UTF-8.
 

The major issues with the UTF-8 collations are:

  1. the primary problem they are trying to solve (i.e. enable Unicode in existing systems currently using VARCHAR without much ability to change to NVARCHAR) most likely affects only a small portion of customers,
  2. the secondary benefit (i.e. saving space) only happens under specific conditions otherwise can cause the data to expand in size,
  3. there is typically a slight performance hit when using UTF-8 collations,
  4. extensive testing is required because only the first 128 characters are 1 byte each, so it will be easy to miscalculate max size needed for variables and columns, and not having enough space leads to silent truncation (i.e. data loss),
  5. none of the three potential down-sides (performance hit, data bloat, or data loss ) are mentioned in the documentation,
  6. most people still don’t have a good understanding of how encodings and collations work (even Microsoft talks about it incorrectly, in different ways, in each of the three documents where UTF-8 support is mentioned!)

All of that together — very few instances where UTF-8 collations should be used, several potential consequences that are not documented anywhere outside of this blog post, misleading information from Microsoft, and a general lack of understanding of how these things actually work — indicates (to me, at least) that it is far more likely that the UTF-8 collations will be used when they shouldn’t be, rather than when they should be.
 

Considering that the probability of failure is higher than the probably of success, I believe the better approach would be:

  1. for compatibility: have something like a database-scoped configuration option to treat all strings as NVARCHAR. This would make it completely transparent to app code, and not require internal UTF-8 to UTF-16 conversions. It’s not a fool-proof plan, but clearly neither are the UTF-8 collations. At the very least, have a startup-only trace flag that determines if the UTF-8 collations are available or not, and hidden by default.
  2. for compression: get Unicode Compression working for NVARCHAR(MAX). This wouldn’t be limited to only helping ASCII characters. I don’t know what technical issues are getting in the way of implementing Unicode Compression for NVARCHAR(MAX) , but given how much will break (or will at least be clunky) in making UTF-8 work, I have a hard time believing that this far better alternative would have been any worse.

This assessment is nothing against UTF-8. It’s certainly the best option in many cases, especially at the OS level where there’s no opportunity for Unicode Compression. Nor is this to be taken as a criticism of the SQL Server team responsible for producing this feature. I think they do great work, and I love working with SQL Server. But, I think this particular feature could have used more research and planning, and as is the case for many features, would benefit from a lot more technical guidance, since it is far from self-explanatory.

BOTTOM LINE: do not use the UTF-8 collations unless your project fits into one of these two scenarios:

  1. You have a lot of data that is mostly standard ASCII (values 0 – 127), is stored in NVARCHAR(MAX) columns, and the table is not a good candidate for having a Clustered Columnstore Index.
  2. You have either legacy code and want to support Unicode, or code that works with other databases systems that support UTF-8, and you need to continue using VARCHAR datatypes and string literals that do not have an upper-case “N” prefix.

 

Basics

Documentation

Let’s start with what we are told about this new feature. According to the documentation, the new UTF-8 Collations:

  1. can be used …
    1. as a database-level default Collation
    2. as a column-level Collation
    3. by appending “_UTF8” to the end of any Supplementary Character-Aware Collation (i.e. either having “_SC” in their name, OR being of level 140 or newer and not being binary)
    4. with only the CHAR and VARCHAR
  2. (implied) have no effect on NCHAR and NVARCHAR data (meaning: for these types, the UTF-8 Collations behave the same as their non-UTF-8 equivalents
  3. “This feature may provide significant storage savings, depending on the character set in use.” (emphasis mine)

Installing

When installing SQL Server 2019, the default Collation (at least for systems with English as the OS-level language) is stillSQL_Latin1_General_CP1_CI_AS“. Why, why, WHY???? Why not use “Latin1_General_100_CI_AS_SC“, which has the same sensitivities but is newer, has sort weights for more characters, supports Supplementary Characters, and is a Windows Collation (hence you don’t prevent seeks on indexed VARCHAR columns when compared to Unicode data). Sure, changing the default Collation after all these years will result in some headaches when needing to work with existing systems (that didn’t bother to change the default Collation), but it ensures that new projects started by folks who don’t know to change it will spread farther and farther throughout the galaxy, tightening its stranglehold on us (i.e. making it even harder to move away from). But that has nothing to do with UTF-8, so we can ignore it for now ( <grrrrr> , but please vote for Make Latin1_General_(100)_CI_AS the default collation for US English).

On the “Customize Database Engine Collation” screen, when selecting a Windows Collation:

  1. Collations without a version number do not allow checking either the “Supplementary characters” or “Variation selector-sensitive” options.
  2. Collations with version numbers “90” or “100” allow for checking the “Supplementary characters” option, but not the “Variation selector-sensitive” option.
  3. Collations with version number “140” have the “Supplementary characters” option checked (and cannot be unchecked), and allow for checking the “Variation selector-sensitive” option.
  4. (NOTE: For recent update that renders this point obsolete, please see “Update for CTP 2.1” section) There is no means of selecting a UTF-8 Collation, even if the “Supplementary characters” option is checked.

Using

By executing the queries below, we can confirm that adding _UTF8 to the end of Supplementary Character-Aware Collations does produce valid Collation names. I executed the following in [master] (which is guaranteed to not be using a UTF-8 Collation as its default Collation since there was no way to install SQL Server with one):

-- Version 90 (SQL 2005; _SC version of it started with SQL Server 2012)
SELECT 'a' COLLATE Japanese_90_CS_AS_SC_UTF8;

-- Version 100 (SQL 2008; _SC version of it started with SQL Server 2012)
SELECT 'a' COLLATE Latin1_General_100_CI_AS_SC_UTF8;

-- Version 140 (SQL 2017; no _SC version since it is implicitly supported)
SELECT 'a' COLLATE Japanese_XJIS_140_CS_AS_UTF8;

All of the above queries returned successfully.

The following query will list all of the UTF-8 Collations:

SELECT col.[name],
       COLLATIONPROPERTY(col.[name], 'CodePage') AS [CodePage],
       COLLATIONPROPERTY(col.[name], 'Version') AS [Version]
FROM   sys.fn_helpcollations() col
WHERE  col.[name] LIKE N'%[_]UTF8'
ORDER BY col.[name];
-- 1552 rows, all with CodePage = 65001

There are 5507 total Collations in SQL Server 2019, which is 1552 more than the 3955 that exist in SQL Server 2017. Hence the only new Collations are these “_UTF8” Collations.

If you execute the query above, do you notice anything missing? That’s right: no binary options for the “_UTF8” Collations (this is a mistake that needs to be addressed).

The following query shows that we can indeed use use VARCHAR to store Unicode characters, but only when using one of the new UTF-8 Collations:

SELECT NCHAR(27581), -- 殽
       CONVERT(VARCHAR(3), NCHAR(27581)
                           COLLATE Latin1_General_100_CI_AS_SC), -- ?
       CONVERT(VARCHAR(3), NCHAR(27581)
                           COLLATE Latin1_General_100_CI_AS_SC_UTF8); -- 殽

The following query confirms that the new UTF-8 Collations cannot be used with the TEXT datatype:

SELECT CONVERT(TEXT, N'a' COLLATE Latin1_General_100_CI_AS_SC_UTF8);

returns:

Msg 4189, Level 16, State 0, Line XXXXX
Cannot convert to text/ntext or collate to 'Latin1_General_100_CI_AS_SC_UTF8' because these legacy LOB types do not support UTF-8 or UTF-16 encodings. Use types varchar(max), nvarchar(max) or a collation which does not have the _SC or _UTF8 flags.

The following query shows that the NVARCHAR value is the same between the UTF-8 and non-UTF-8 Collation, and the same character can be held in VARCHAR, but only when using a UTF-8 Collation, and as a different underlying byte sequence:

DECLARE @Sample NVARCHAR(1) = NCHAR(0x5010);
SELECT @Sample; -- 倐

SELECT CONVERT(VARBINARY(2), @Sample
                    COLLATE Latin1_General_100_CI_AS_SC), -- 0x1050
       CONVERT(VARBINARY(2), @Sample
                    COLLATE Latin1_General_100_CI_AS_SC_UTF8), -- 0x1050

       CONVERT(VARCHAR(10), @Sample
                    COLLATE Latin1_General_100_CI_AS_SC), -- ?

       CONVERT(VARCHAR(10), @Sample
                    COLLATE Latin1_General_100_CI_AS_SC_UTF8), -- 倐
       CONVERT(VARBINARY(2), CONVERT(VARCHAR(10), @Sample
                    COLLATE Latin1_General_100_CI_AS_SC_UTF8)); -- 0xE580;

And, the following demonstrates that a UTF-8 Collation can be used at the database level:

CREATE DATABASE [UTF8] COLLATE Latin1_General_100_CI_AS_SC_UTF8;

ALTER DATABASE [UTF8] SET RECOVERY SIMPLE;

USE [UTF8];

The USE statement is there because the remaining examples that follow in this post will be executed from within the “UTF8” database.

General UTF-8 Info

UTF-16 (for context)

UTF-16, which is the Unicode encoding used for NVARCHAR data, uses 16-bit (i.e. 2-byte) values, known as “code units“, to represent characters. These “code units” are used individually (a 2-byte code unit), or in pairs (2 two-byte code units) to get Supplementary Characters, to map to individual “code points“. UCS-2, which only maps the BMP characters (i.e. the first 65,536 code points), is a fixed-length encoding because it only deals with single code units. UTF-16, which uses the exact same code units as UCS-2, can use certain combinations of two of those code units (called Surrogate Pairs) to map the non-BMP characters (i.e. code points 65,537 and above, known as Supplementary Characters).

Because these values are two bytes, they are subject to “endianness”, which refers to the order in which the individual bytes that make up the value are stored internally (in memory, on disk, etc). The hardware determines the “endianness” of how multi-byte values (including INT, SMALLINT, BIGINT) are stored, and the two possibilities are “Big Endian”, which is the given / natural ordering (i.e. value 0x1234 — where 0x12 is byte #1, and 0x34 is byte #2 — is stored as 0x1234), or “Little Endian”, which refers to storing the bytes in reverse-order (i.e. value 0x1234 is stored as 0x3412).

Unicode strings in Windows, “string” in .NET, and NVARCHAR in SQL Server are all UTF-16 Little Endian.

UTF-8

UTF-8 is another Unicode encoding which uses between one and four 8-bit (i.e. 1-byte) code units to represent characters. Even though it is Unicode, due to working 8-bits at a time, this encoding is considered VARCHAR data. This is also a variable-length encoding because it requires anywhere from 1 to 4 code units to map to any given code point. Due to these code units being single bytes (even in combinations of 2, 3, or 4), there is no concept of “endianness”.

One main advantage of UTF-8 is that it’s fully compatible with any existing data that’s comprised entirely of 7-bit ASCII characters (values 0 – 127). This is by design, but not really a reason for SQL Server to use it for storing data internally. The other main advantage of UTF-8 is that it can take up half the space while still allowing for the full range of Unicode characters. This would be a reason for SQL Server to use it internally. BUT, it only takes up half the space for a small subset of characters. Of course, if the characters in this small subset that are only 1 byte are the vast majority of characters are being used, then this encoding does provide for significant space savings (in terms of disk space, memory, and reduced network traffic). What people tend to overlook, though, is that most Unicode characters require 3 bytes, which is 1 more byte than UTF-16 for those same characters.

You can execute the following query in SQL Server 2019 to see how many bytes each character requires for both UTF-8 and UTF-16 encodings. It returns all 65,536 BMP (Base Multilingual Plan) characters (which is also the entire UCS-2 character set), and 3 Supplementary Characters. Since all Supplementary Characters are 4 bytes in both encodings, there is no need to return more of them, but we do need to see a few of them to see that they are a) all 4 bytes, and b) encoded slightly differently. The query will work the same regardless of whether or not the current database uses a “_UTF8” Collation. It also does not matter if the database’s default Collation is Supplementary Character-Aware or not.

;WITH nums([CodePoint]) AS
(
  SELECT TOP (65536) (ROW_NUMBER() OVER (ORDER BY (SELECT 0)) - 1)
  FROM   [master].[sys].[columns] col
  CROSS JOIN [master].[sys].[objects] obj
), chars AS
(
  SELECT nums.[CodePoint],
         CONVERT(VARCHAR(4), NCHAR(nums.[CodePoint])
                    COLLATE Latin1_General_100_CI_AS_SC_UTF8) AS [TheChar],
         CONVERT(VARBINARY(4), CONVERT(VARCHAR(4), NCHAR(nums.[CodePoint])
                    COLLATE Latin1_General_100_CI_AS_SC_UTF8)) AS [UTF8]
  FROM   nums
  UNION ALL
  SELECT tmp.val,
         CONVERT(VARCHAR(4), CONVERT(NVARCHAR(5), tmp.hex)
                    COLLATE Latin1_General_100_CI_AS_SC_UTF8) AS [TheChar],
         CONVERT(VARBINARY(4), CONVERT(VARCHAR(4), CONVERT(NVARCHAR(5),
              tmp.hex) COLLATE Latin1_General_100_CI_AS_SC_UTF8)) AS [UTF8]
  FROM   (VALUES (65536, 0x00D800DC), -- Linear B Syllable B008 A (U+10000)
                 (67618, 0x02D822DC), -- Cypriot Syllable Pu (U+10822)
                 (129384,0x3ED868DD)  -- Pretzel (U+1F968)
         ) tmp(val, hex)
)
SELECT chr.[CodePoint],
       COALESCE(chr.[TheChar], N'TOTALS:') AS [Character],
       chr.[UTF8] AS [UTF8_Hex],
       DATALENGTH(chr.[UTF8]) AS [UTF8_Bytes],
       COUNT(CASE DATALENGTH(chr.[UTF8]) WHEN 1 THEN 'x' END) AS [1-byte],
       COUNT(CASE DATALENGTH(chr.[UTF8]) WHEN 2 THEN 'x' END) AS [2-bytes],
       COUNT(CASE DATALENGTH(chr.[UTF8]) WHEN 3 THEN 'x' END) AS [3-bytes],
       COUNT(CASE DATALENGTH(chr.[UTF8]) WHEN 4 THEN 'x' END) AS [4-bytes],
      ---
      CONVERT(VARBINARY(4), CONVERT(NVARCHAR(3), chr.[TheChar]))
         AS [UTF16(LE)_Hex],
      DATALENGTH(CONVERT(NVARCHAR(3), chr.[TheChar])) AS [UTF16_Bytes],
      ---
      ((DATALENGTH(CONVERT(NVARCHAR(3), chr.[TheChar]))) -
            (DATALENGTH(chr.[TheChar]))) AS [UTF8savingsOverUTF16]
FROM   chars chr
GROUP BY ROLLUP ((chr.[CodePoint], chr.[TheChar], chr.[UTF8]));

There isn’t enough room here to show all 65,538 rows, so here is a summary of the info:

Code Points Code Point Quantity UTF-8 Size (bytes) UTF-16 Size (bytes) UTF-8 savings
0 – 127 128 1 2 1
127 – 2047 1920 2 2 0
2048 – 65535 63488 3 2 -1
65536 – lots 4 4 0

As you can see, only the first 128 Code Points offer any size savings over UTF-16. The characters in that range are (minus the first 32 which are control characters, and the last character):

[space] ! " # $ % & ' (
) * + , . / 0 – 9 :
; < = > ? @ A – Z [ \
] ^ _ ` a – z { | } ~

Those characters, and their values (32 – 126), are exactly the same as the standard ASCII character set, and are the same across most code pages (certainly all code pages supported by SQL Server). Any characters not in the chart above do not offer any space savings, with most of the BMP characters actually taking up more space than they would if using UTF-16 (i.e. NVARCHAR ).

Setting and Storing UTF-8 Values

An encoding is how the characters are represented internally. Whether the source is a non-Unicode string, a Unicode string, or a binary string (i.e. series of bytes / hex bytes), the end result will be the chosen encoding’s value for that character.

First, let’s try setting a variable, in the “[UTF8]” database, from a VARBINARY literal:

DECLARE @HotDog VARCHAR(4) = 0xF09F8CAD;
PRINT @HotDog; -- ( U+1F32D )

Returns:   🌭 (in the “Messages” tab)

As you can see, it produces the expected character from the given 4 bytes. However, those same 4 bytes do not produce the correct character when stored as NVARCHAR , which expects UTF-16 Little Endian bytes sequences:

DECLARE @NotHotDog NVARCHAR(4) = 0xF09F8CAD;
PRINT @NotHotDog; -- ( U+1F32D )
-- 鿰권

Now let’s look at how the size of the (N)VARCHAR types impact what can be stored in them:

SELECT CONVERT( VARCHAR(3), 0xF09F8CAD) AS [TooSmallForUTF8-HotDog],
     DATALENGTH(CONVERT(VARCHAR(3), 0xF09F8CAD)) AS [ByteCountForTooSmall],
       CONVERT( VARCHAR(4), 0xF09F8CAD) AS [UTF8-HotDog],
       DATALENGTH(CONVERT(VARCHAR(4), 0xF09F8CAD)) AS [ByteCountForHotDog],
       ------
       CONVERT(NVARCHAR(1), 0x3CD82DDF) AS [TooSmallForUTF16-HotDog],
       CONVERT(NVARCHAR(2), 0x3CD82DDF) AS [UTF16-HotDog];

Returns:

TooSmallFor
UTF8-HotDog
ByteCountFor
TooSmall
UTF8-
HotDog
ByteCountFor
HotDog
TooSmallFor
UTF16-HotDog
UTF16-
HotDog
0 🌭 4 🌭

The first column appears to return nothing, not even a partial, incorrect / unknown character. The second column confirms that there is absolutely nothing resulting from attempting to store a 4-byte UTF-8 value in a VARCHAR type that is only 3 bytes. It couldn’t even take just the first 3 bytes of the 4-byte value. The third and fourth columns show that given the type the required 4 bytes allows everything to work as expected.

On the other hand, when attempting to store the same character, using the UTF-16 4-byte sequence, into an NVARCHAR type that is too small, the result is the default Replacement Character ( U+FFFD ). This is because it was able to store the first half of the Surrogate Pair (the 0x3CD8), which is meaningless on its own, but it is still a valid byte sequence.

Size and Performance

{ This section is taking a while as I figure out how to best present the variety of tests and their results, but the general info is all there. }

Data Profiles

  1. Some ASCII (values 1 – 254), but mostly non-ASCII
    • [N]VARCHAR(500)
    • 10,539 rows
    • Characters per row: MIN = 1; AVG = 14; MAX = 30
  2. Some ASCII MAX
    • [N]VARCHAR(MAX)
    • 10,539 rows
    • Characters per row: MIN = 1; AVG = 14; MAX = 30
       
  3. All ASCII
    • [N]VARCHAR(500)
    • 105,390 rows
    • Characters per row: MIN = 6; AVG = 24; MAX = 60
  4. All ASCII MAX (all in row)
    • [N]VARCHAR(MAX)
    • 105,390 rows
    • Characters per row: MIN = 6; AVG = 24; MAX = 60
       
  5. All ASCII MAX (10k characters per row; all off-row)
    • [N]VARCHAR(MAX)
    • 10,539 rows
    • Characters per row: MIN = 10,007; AVG = 10,025; MAX = 10,061
  6. All ASCII MAX (5k characters per row; mixed in / off -row; UTF-8 = in-row, NVARCHAR = off-row)
    • [N]VARCHAR(MAX)
    • 21,078 rows
    • Characters per row: MIN = 5,007; AVG = 5,025; MAX = 5,061

Tests:

These were executed in a database which has a “_UTF8” default Collation.

SELECT REVERSE([Value])
FROM dbo.[{table}];


SELECT SUBSTRING([Value], 5, 13)
FROM dbo.[{table}];


SELECT [Value]
FROM dbo.[{table}]
WHERE [Value] = 'abcdef12345'; -- UTF16 test uses N'abcdef12345'


SELECT [Value]
FROM dbo.[{table}]
WHERE [Value] LIKE '%123%'; -- UTF16 test uses N'%123%'


SELECT MIN(LEN([Value])), AVG(LEN([Value])), MAX(LEN([Value]))
FROM   dbo.[{table}];


SELECT [Value]
FROM   dbo.[{table}];

Each test was wrapped in:

SET STATISTICS TIME ON;
...{test}...
SET STATISTICS TIME OFF;

Results

  1. Some ASCII in non-MAX types
    • NVARCHAR is smaller here when Compression = NONE, ROW, or PAGE. COLUMNSTORE estimated better as well, but not tested.
    • Did not test performance given no space savings
  2. Some ASCII in MAX types (in-row)
    • NVARCHAR is smaller here when Compression = NONE, ROW, or PAGE. COLUMNSTORE estimated the same, but not tested.
    • Did not test performance given no space savings
       
  3. All ASCII in non-MAX types
    • UTF-8 is smaller here when Compression = NONE (approx. 37% smaller than NVARCHAR)
    • UTF-8 is only slightly smaller here when Compression = ROW or PAGE (approx. 1% smaller)
    • NVARCHAR is slightly smaller here when table is COLUMNSTORE (approx. 1% smaller)
    • Performance of NVARCHAR is same or better than UTF-8
  4. All ASCII in MAX types (in-row)
    • UTF-8 is smaller here when Compression = NONE (approx. 37% smaller than NVARCHAR; 40% for ROW and PAGE)
    • UTF-8 is only slightly smaller here when table is CLUSTERED COLUMNSTORE (approx. 9% smaller)
    • Performance is sometimes the same, but NVARCHAR is usually better than UTF-8
  5. All ASCII in MAX types (10k; off-row)
    • UTF-8 is smaller here when Compression = NONE (approx. 46% smaller)
    • ROW and PAGE compression do not apply to off-row data
    • UTF-8 is only slightly smaller here when table is CLUSTERED COLUMNSTORE (approx. 1.3% smaller). The NVARCHAR table is still 98% smaller than it was originally.
    • With no compression, NVARCHAR is faster for all tests.
    • When table is CLUSTERED COLUMNSTORE, NVARCHAR is faster for all tests except “SUBSTRING” and “=“.
  6. All ASCII in MAX types (5k; mixed on / off-row: VARCHAR = on, NVARCHAR = off)
    • UTF-8 is smaller here when Compression = NONE, ROW, or PAGE (approx. 25% smaller)
    • UTF-8 is much smaller here when table is CLUSTERED COLUMNSTORE (approx. 92% smaller). BUT, this is typically a non-issue since the NVARCHAR table is still 96% smaller than it was originally. This becomes an issue as the uncompressed data gets closer to 100 GB.
    • With no compression, NVARCHAR is faster for all tests except “LIKE ‘%x%'”.
    • When table is CLUSTERED COLUMNSTORE, UTF-8 is faster for all tests except “REVERSE” (elapsed time was the same, but CPU was better)

Typically there is at least a slight performance hit when using UTF-8. I believe the issue is due to SQL Server still using UTF-16 to process most (but not all) operations internally. This means that there is a lot of converting back and forth between UTF-8 and UTF-16.

The area of biggest improvement of UTF-8 over NVARCHAR is when there is between 4001 and 8000 bytes of mostly standard ASCII characters. In this scenario, you would need to use NVARCHAR(MAX) to get over 4000 bytes, but then you lose the ability to use Unicode Compression. And, being over 4000 bytes, the NVARCHAR data must be stored off-row (a slight performance hit), while that same data can be stored in-row when using UTF-8. This is the only scenario where UTF-8 is better for both space and performance.

Limitations

Known (i.e. Stated in Documentation)

The UTF8 Collations do not work with:

  1. (NOTE: For recent update that renders this point obsolete, please see “Update for CTP 2.2” section) Replication
  2. Linked Servers
  3. In-memory OLTP   (NOTE: For clarification based on recent testing, please see “Update for CTP 2.1” section)
  4. External Table for Polybase ( Two collation problems with external tables in SQL 2019 )
  5. (NOTE: For recent update that renders this point obsolete, please see “Update for CTP 2.1” section) UTF8 Collations cannot be used as the Instance-level Collation (** please see next section for the “EXCEPT, …”):
    • Currently wording is that they can only be used at the database and column level, but yes, they can also be used inline via the COLLATE keyword.
    • Not an option via the installer (as mentioned at the top of this post)
    • Cannot be specified via:
      SETUP.EXE /QUIET /ACTION=REBUILDDATABASE ^
      /SQLCOLLATION=Latin1_General_100_CI_AS_SC_UTF8 ^
      ...

      The result is:
      The following error occurred:
      The collation Latin1_General_100_CI_AS_SC_UTF8 was not found.
       
      Error result: -2061893630
      Result facility code: 1306
      Result error code: 2

Unknown (i.e. Oops)

  1. (NOTE: For recent update that renders this point obsolete, please see “Update for CTP 2.1” section) UTF8 Collations can be used as the Instance-level Collation. This cannot be set via the installer or running SETUP.EXE with the /SQLCOLLATION switch, BUT they can be used via the undocumented method (of changing all Collations for an Instance):
    sqlservr -c -m -T4022 -T3659 -s"{instance_name}" ^
    -q"Latin1_General_100_CI_AS_SC_UTF8"`

    Yes, this does work. I have tested it. For more details on this method, please see:
    Changing the Collation of the Instance, the Databases, and All Columns in All User Databases: What Could Possibly Go Wrong?
    PLEASE NOTE: Just because you can do this doesn’t mean that you should do this. Given that the only way of setting an Instance to use a UTF8 Collation is via an undocumented method, it is quite possible that this has never been tested internally at Microsoft and is not guaranteed to work. It might work perfectly, or there could be unexpected behaviors. Use at your own risk.
     
  2. There are no binary Collations that support the UTF8 encoding!!! Why not?. But why do we need them? Here are some reasons:
    1. Microsoft’s implementation of Unicode has incomplete, and some incorrect, sort weights. One example:
      Choosing a binary collation that can differentiate between ‘ss’ and ‘ß’ for nvarchar column in Sql Server
    2. When needing to know if two values are 100% identical, you need a binary Collation (such as when auditing for changes, etc). Even if all characters had correct sort weights, you still wouldn’t be able to differentiate between a base character plus a combining diacritic and the pre-composed single character equivalent. Nor would you be able to differentiate between a base character plus multiple combining diacritics and the same characters with a different order for the diacritics. In both of those cases, the visual representation of the “character” is the same, but the number or order of the underlying Code Points would be different. The first item from this post provides a working example: No, Binary Collations are not Case-Sensitive
    3. When storing alphanumeric codes (e.g. airline confirmation codes, phone numbers, postal codes, etc), binary sorting and comparison is much faster since it ignores all linguistic rules, which is perfect for scenarios such as these where there is no expectation of applying such rules.
    4. Finding / removing “(null)” character U+0000 / CHAR(0)
       
  3. Bugs:
    1. NVARCHAR to VARCHAR(MAX) can get Msg 8152 “String or binary data would be truncated” — FIXED IN CTP 2.1
    2. Invalid UTF-8 bytes get (Msg 682, Level 22, State 148) Internal error. Buffer provided to read column value is too small. OR get mysterious, changing 3rd byte of 2-byte value — FIXED IN CTP 2.1
    3. Inconsistent handling of invalid UTF-8 byte sequences
    4. CHAR() function broken in databases with “_UTF8” default Collation (not a bug after all; just needs better documentation)
    5. An error occurred while executing batch. Error message is: Internal connection fatal error. Error state: 15, Token : 2″ (disconnect) — caused by NULL UTF-8 values — FIXED IN CTP 2.1
    6. Collation Precedence allows for Unicode character data loss concatenating UTF8 with higher precedence non-UTF8 VARCHAR
    7. Please see consolidated issue list towards the bottom.

Recommended Uses / Guidance

The UTF-8 encoding, being a variable-length encoding, can be a huge benefit in some scenarios, but it can also make things worse in others. Unfortunately, there is very little use for a “_UTF8” encoding given that Data Compression and Clustered Columnstore Indexes are available across all editions of SQL Server. The only scenario that truly benefits from a UTF-8 encoding is one in which all of the following conditions are true:

  1. Data is mostly standard ASCII (values 0 – 127), but either has, or might have, a small amount of a varying range of Unicode characters (more than would be found on a single 8-bit Code Page, or might not exist on any 8-bit Code Page)
  2. Column is currently (or otherwise would be) NVARCHAR(MAX) (meaning, data won’t fit into NVARCHAR(4000))
  3. There is a lot of data for this column or set of columns (1 GB or more when stored in NVARCHAR)
  4. Performance would be negatively impacted by making the table a Clustered Columnstore table (due to how the table is used) OR data is typically < 8000 bytes
  5. There is no desire to make the column VARBINARY(MAX), use COMPRESS() for INSERT and UPDATE operations, and use DECOMPRESS() for SELECT queries (no need to worry about lack of ability to index the VARBINARY value since it is MAX data anyway that cannot be indexed). Keep in mind that the Gzipped value will be much smaller than even the UTF-8 version of the string, though it would require decompressing before the value could be filtered on (outside of “=”) or manipulated.
  6. The benefits of reducing the size of backups and reducing the time it takes to backup and restore, and reducing the impact on the buffer pool, outweigh the cost of the likely negative impact on query performance (for both CPU and elapsed times). Just keep in mind that Backup Compression (available in Enterprise and Standard Editions) might help here.

Storing HTML pages is a good example of a scenario that fits this description. UTF-8 is, of course, the preferred encoding for the interwebs precisely due to it using the minimal space for the most common characters while still allowing for the full range of Unicode characters.

Conclusion

False Prophet! Adding support for UTF-8 is certainly interesting. However, being the preferred encoding for the web, and even the default encoding for Linux / Unix (at least some flavors), doesn’t imply that UTF-8 is truly useful within SQL Server. UTF-8 is a great choice for scenarios where there is:

  1. a need to be fully compatible with standard ASCII (to work with legacy systems / code)
    and / or
    a desire to at least save space when most data is standard ASCII anyway,
  2. a need / desire to support all characters (especially without needing to change Code Pages), and
  3. no ability to use Unicode Compression.

So, within the context of SQL Server, UTF-8 has the potential for helping a very specific, but not unimportant, use case: NVARCHAR(MAX) columns, where Unicode Compression doesn’t work. Still, I think the development / testing time spent on this would have been much better applied to something more useful to more users, such as the following (the first two of which would do a better job of accomplishing the space-saving goal of UTF-8, and without sacrificing performance):

  • getting Unicode Compression (part of Data Compression) to work on at least in-row values of NVARCHAR(MAX), if not also off-row values (please vote for: “unicode compression nvarchar(max)” — this would save more space than UTF-8)
  • getting Data Compression to work for off-row LOB / Row-Overflow values (please vote for: “Page Compression for Out-of-Row LOB Storage” — this would save space for more than just Unicode strings: it would also apply to VARCHAR, VARBINARY, and XML)
  • getting mixed alphanumeric values to sort as expected (i.e. “natural sorting” — “21a” sorts after “3a” instead of before — please vote for: “Support "natural sorting" / DIGITSASNUMBERS as a Collation option“)

Given that it is not fun to change Collations for databases, I would strongly recommend against using the new UTF-8 Collations, at least not at the database level, and definitely not in Production. With the obviously missing binary versions, and 5 bugs found in only a couple of days of testing for this post (not testing a full, complex application), this feature is incomplete, and there might yet be more bugs / odd behavior. I would suggest, if it looks like it will take quite a bit of development time to fix these six issues, then perhaps just remove this feature entirely (no big loss as it provides very little, if any, benefit) and spend that time on one of the three items noted directly above (think of how many customers / projects would benefit from Data Compression working for off-row LOB / overflow data).

If you are going to use a UTF-8 Collation on one or more columns, make no assumptions about size reduction or better performance! Be sure to capture current size and performance stats while column is NVARCHAR, and then compare both after changing to UTF-8.

Resources


 

UPDATE 2018-11-27 / CTP 2.1

CTP 2.1 was released in early November (updated “What’s new” info), and includes:

Quirks

Here are some issues that you might run into when using a UTF-8 Collation at either the Instance level or Database level. These are most likely not bugs since they have been deprecated for a while now. Still, it is possible that you have legacy systems and/or 3rd party applications (some that you cannot change) using these features.

  1. TEXT / NTEXT datatypes in temporary tables (when Instance-level Collation is “_UTF8”) and table variables (when Database-level Collation is “_UTF8”):
      CREATE TABLE [#D'oh] (col1 NTEXT); -- ' (CSS formatting bug work-around)
      /*
      Msg 4188, Level 16, State 1, Line XXXXX
      Column or parameter 'col1' has type 'ntext' and collation
         'Japanese_XJIS_140_CI_AI_VSS_UTF8'. The legacy LOB types do not
         support UTF-8 or UTF-16 encodings. Use types varchar(max),
         nvarchar(max) or a collation which does not have the _SC or
         _UTF8 flags.
      */
      

    Of course, there is an easy work-around to these errors: for columns using either of these datatypes, use the COLLATE keyword to specify a non-UTF8 Collation (e.g. “COLLATE Latin1_General_100_CI_AS“).

  2. sys.fn_trace_gettable: The “TextData” column of the result set uses the NTEXT datatype. This should only be an issue when using a UTF-8 Collation at the Instance level.
  3. sys.fn_get_sql: The “text” column of the result set uses the TEXT datatype (which was wrong even before this function was deprecated since the datatype should have been NTEXT 😲 ):

      SELECT * FROM sys.fn_get_sql(0x03);
      /*
      Msg 4189, Level 16, State 0, Procedure fn_get_sql,
         Line XXXXX [Batch Start Line YYYYY]
      Cannot convert to text/ntext or collate to
         'Japanese_XJIS_140_CI_AI_VSS_UTF8' because these legacy LOB
         types do not support UTF-8 or UTF-16 encodings. Use types
         varchar(max), nvarchar(max) or a collation which does not
         have the _SC or _UTF8 flags.
      Msg 4413, Level 16, State 1, Line XXXXX
      Could not use view or function 'fn_get_sql' because of binding errors.
      */
      

    This should only be an issue when using a UTF-8 Collation at the Instance level.

You can use the following query to check your system to see if there are any columns using either the TEXT or NTEXT datatypes. This does not check either stored procedure / trigger / function / SQL Agent T-SQL job steps content to see if these datatypes are being used in temporary tables or table variables or permanent tables being created or altered, nor does it check T-SQL code submitted by anything outside of SQL Server (such as your app code). So be sure to test, test, test (which you were going to do anyway, right?).

Now, to be fair, these same quirks exist when using any Supplementary Character-Aware Collation (i.e. one with either “_SC” or “_140_” in its name) at the Instance or Database level.

Please note that the following query only reports objects for the database in which it is being executed (also, the two deprecated functions mentioned above will show up in all databases as they are global).

SELECT OBJECT_SCHEMA_NAME(ac.[object_id]) AS [SchemaName],
       ao.[name] AS [ObjectName],
       ao.[type_desc] AS [ObjectType],
       ac.[name] AS [ColumnName],
       ac.[system_type_id],
       ac.[collation_name]
FROM   sys.all_columns ac
INNER JOIN sys.all_objects ao
        ON ao.[object_id] = ac.[object_id]
WHERE  ac.[system_type_id] IN (35, 99); -- 35 = TEXT ; 99 = NTEXT

In-memory OLTP

The documentation states that the “_UTF8” Collations do not work with In-memory OLTP. We can easily confirm that via the following tests.

This:

CREATE TABLE dbo.UTF8
(
    [ID] INT IDENTITY(1, 1)
        NOT NULL
        PRIMARY KEY NONCLUSTERED HASH WITH (BUCKET_COUNT = 400),
    [Name] VARCHAR(50) COLLATE Latin1_General_100_CI_AS_SC_UTF8
        NOT NULL
)
WITH (MEMORY_OPTIMIZED = ON);

and this:

CREATE TABLE dbo.UTF8
(
    [ID] INT IDENTITY(1, 1)
        NOT NULL
        PRIMARY KEY NONCLUSTERED HASH WITH (BUCKET_COUNT = 400),
    [Name] VARCHAR(50) COLLATE Latin1_General_100_CI_AS_SC -- NOT UTF8
        NOT NULL
)
WITH (MEMORY_OPTIMIZED = ON);
-- Success

ALTER TABLE dbo.[UTF8]
    ALTER COLUMN [Name]
    VARCHAR(50) COLLATE Latin1_General_100_CI_AS_SC_UTF8 NOT NULL;

both get the following error:

Msg 12356, Level 16, State 157, Line XXXXX
Comparison, sorting, and manipulation of character strings that use a UTF8 collation is not supported with memory optimized tables.

And this statement (same as the first CREATE statement directly above, but including an INDEX on the [Name] column):

CREATE TABLE dbo.UTF8
(
    [ID] INT IDENTITY(1, 1)
        NOT NULL
        PRIMARY KEY NONCLUSTERED HASH WITH (BUCKET_COUNT = 400),
    [Name] VARCHAR(50) COLLATE Latin1_General_100_CI_AS_SC_UTF8
        NOT NULL INDEX [IX_UTF8_Name] NONCLUSTERED
)
WITH (MEMORY_OPTIMIZED = ON);

gets the following error:

Msg 12357, Level 16, State 158, Line XXXXX
Indexes on character columns that use a UTF8 collation are not supported with indexes on memory optimized tables.

And, attempting to change a Database’s default Collation to be a “_UTF8” Collation will error:

ALTER DATABASE [Hekaton] COLLATE Latin1_General_100_CI_AS_SC_UTF8;

receives:

Msg 41335, Level 16, State 3, Line XXXXX
Modifying the collation of a database is not allowed when the database contains memory optimized tables or natively compiled modules.

(that error message is not entirely accurate since you can change the Collation to a non-UTF-8 Collation, though you still might need to drop natively compiled modules first.)

HOWEVER, using the “sqlservr -q” method of changing all Collations for an Instance allows us to force a UTF-8 Collation on a Database containing memory-optimized tables, and even on columns in those memory-optimized tables:

sqlservr.exe -c -m -T4022 -T3659 -s"{instance_name}" ^
-q"Latin1_General_100_CI_AS_SC_UTF8"

After this (unsupported) change, the Database generally works, but there are a few definite issues:

  1. Data corruption in VARCHAR columns for characters with values 128 and above (i.e. outside of the standard ASCII set that ends at value 127). This affects “durable” memory-optimized tables (i.e. DURABILITY = SCHEMA_AND_DATA). For example, if the original Collation is Korean_100_CI_AS, then the byte sequence of “0xB15A” will produce “켣”. If you then use this method to switch to any “_UTF8” Collation, you will end up seeing “�Z” because “0xB1” is an invalid byte sequence in UTF-8, and “0x5A” is valid and is the “Z” character.
  2. Indexes on memory-optimized tables will not be dropped and recreated, and so in many cases they will be out of order (i.e. in the wrong order). For example, switching from Latin1_General_100_CI_AS_SC to Latin1_General_100_CI_AS_SC_UTF8 (only difference is adding “_UTF8” to the end of the original Collation) would cause indexes to become out of order, but only if they have key columns that are VARCHAR with data containing characters of values 128 or higher. (INCLUDE columns that are VARCHAR will have corrupted data for characters of values 128 or higher, but this does not affect the order.)
  3. Related to out-of-order indexes: a simple SELECT works, even when adding in ORDER BY or GROUP BY. Adding a WHERE clause will also work, as long as no new rows have been added. Once a new row has been added, using a WHERE clause referencing an indexed string column will result in the following error:
    Msg 9100, Level 21, State 2, Line XXXX
    Possible index corruption detected. Run DBCC CHECKDB.

    And you will be disconnected. OR, after adding a row, you might be able to successfully use a WHERE clause, but the first time you use the ORDER BY clause you will get the following error:
    Msg 701, Level 17, State 153, Line XXXXX
    There is insufficient system memory in resource pool 'default' to run this query.

    After that, you will probably get the “Possible index corruption detected” error, even when using just the WHERE clause.

Keep in mind that indexes on memory-optimized tables cannot be rebuilt or even dropped. Fixing these issues requires dropping, recreating, and reloading any affected tables. Again, this (using UTF-8 Collations with memory-optimized tables) is not a supported configuration.


 

UPDATE 2018-12-26 / CTP 2.2

CTP 2.2 was released in early December (updated “What’s new” info), and includes:

Current Assessment

As of CTP 2.2, my overall assessment of this new feature has not changed: native UTF-8 support is a waste of time and effort / wasted opportunity considering:

  1. this really only benefits a small subset of use cases / customers
  2. this is a storage format; operations requiring linguistic rules and/or normalization are still handled in UTF-16 which requires converting in memory each time
  3. operations requiring in-memory conversion to UTF-16 are slower (generally) than using NVARCHAR
  4. there is a very large potential here for customers to hurt their systems by misunderstanding the appropriate uses and drawbacks of UTF-8, and applying it to data that will end up taking more space and/or will be an unnecessary performance hit.
  5. expanding data compression to work with NVARCHAR(MAX) (ideally including off-row data, but even if only working with in-row data) would:
    1. apply to many more use-cases
    2. would do better data compression than the best-case for UTF-8
    3. would not have the performance hit
    4. would not require any re-coding of apps
    5. would not introduce so many bugs
    6. would not have the potential for customers to accidentally bloat their systems or degrade performance without even getting any space savings

Yet, even if they remove the new UTF-8 feature (which they should), they now don’t have any time left to implement the appropriate feature (especially considering that removing this feature will take development and testing time). That is why this is such an unfortunate wasted opportunity. 😿


 

UPDATE 2019-01-28 / Real Purpose

On December 18th, 2018, new information was provided about the UTF-8 collations, and included a reason for this new feature that had not been mentioned before:

scenarios where legacy applications require internationalization and use inline queries: the amount of changes and testing involved to convert an application and underlying database to UTF-16 can be costly

What that means is, existing applications using VARCHAR data will have that data stored in VARCHAR columns, use VARCHAR variables and parameters, and will have string literals that do not have an upper-case “N” prefix. Supporting Unicode would previously require changing the datatypes of columns, variables, and parameters to NVARCHAR, and string literals would need to be prefixed with an upper-case “N”. It could take a significant amount of time to make and test those changes. By adding the UTF-8 collations, it will be possible to skip most of that work. Ok, sounds reasonable.

Of course, what is not being said is that there is still a significant amount of work entailed in changing the collations of ALL string columns, especially where indexes, foreign keys, defaults, check constraints, computed columns, etc are involved. In those cases you need to drop the dependent object(s) so that the column can be updated with the new collation. While there is an undocumented short-cut to all of that work, there is still A LOT of testing to do. I’m sure some people (maybe many?) will test with only their existing data and test cases, and that will be a huge mistake. Only the first 128 characters take up a single byte. But, the whole point of switching to UTF-8 is to support Unicode, so we need to assume that there will be characters being stored that were not available in the one code page that was previously supported. Not only will all of those new characters take up more than 1 byte, but even the second 128 characters from the original code page (which previously only required a single byte) will now require either 2 or 3 bytes (more on that in a moment). This means that a VARCHAR(20) column for a first or last name is no longer guaranteed to always store up to 20 characters. Now it will store up to 6 – 20 characters, depending on how many bytes each character requires.

So how many customers / potential customers does this scenario actually apply to? Definitely some, but I wouldn’t think that many. To me, this does not justify the level of modification of the code base that’s required to handle sticking Unicode in what has always been non-Unicode datatypes. Sure, at first glance it seems like an easy fit since it’s still 8-bit data, but I think the complexity of this project was underestimated. And, I think that the long-term ramifications of adding “_UTF8” to the collation names hasn’t been fully considered. Adding the UTF-8 collations was a 39% increase in the number of collations. Now, if another option is to be added later, there could be up to 3982 collation names to duplicate.

Current Assessment

Knowing that the main reason for introducing the UTF-8 collations is for compatibility certainly helps, but I still have a bad feeling about this. This is a lot of complexity to add, and with the benefits applying to a (relatively) small number of users, it’s far more likely that this feature will be misused rather than used such that it provides an actual benefit to the project. I say this because a lot of people have heard about UTF-8, but not that many fully understand it (the topic of collations, in general, is confusing to most people). The Microsoft documentation is a perfect example: the two official documentation pages don’t even mention compatibility (it’s only mentioned in a blog post), and what is said about the potential for saving space (in two places) is misleading. For example, in the blog post, it states:

if your dataset uses primarily Latin characters, significant storage savings may also be achieved as compared to UTF-16 data types

That statement is inaccurate: it’s only the standard ASCII characters that provide any space savings. Yes, those characters are Latin characters, but they’re not the only Latin characters. There are 128 other characters on most code pages (a few have more), and they are all 2 or 3 bytes each in UTF-8 (even the ones on code page 1252, which is Latin1_General). You can see for yourself using the example code below:

-- DROP TABLE #Space;
CREATE TABLE #Space
(
  [CodePoint] TINYINT NOT NULL,
  [CP1252] VARCHAR(5) COLLATE Latin1_General_100_CI_AS_SC NOT NULL,
  [UTF8] VARCHAR(5) COLLATE Latin1_General_100_CI_AS_SC_UTF8,
  [UTF8bytes] AS (DATALENGTH([UTF8]))
);

;WITH cte AS
(
  SELECT TOP(128) ROW_NUMBER() OVER (ORDER BY @@SPID) + 127 AS [num]
  FROM   sys.all_columns
)
INSERT INTO #Space ([CodePoint], [CP1252])
  SELECT cte.[num], CONVERT(BINARY(1), cte.[num])
  FROM   cte;


-- SELECT * FROM #Space;

UPDATE #Space
SET [UTF8] = [CP1252];

SELECT *
FROM #Space
ORDER BY [UTF8bytes], [CodePoint];
-- 111 are 2 bytes
--  17 are 3 bytes

Of course, if your data is all standard ASCII then you will get the full extent of the space savings (often at the cost of performance), but that statement implies that it is all of code page 1252 that is eligible for the space savings, and that just isn’t true.

Then, the “What’s New in SQL Server 2019” page states:

This feature may provide significant storage savings, depending on the character set in use.

“Character set” is synonymous with “code page”. However, which code page is being used has nothing to do with the potential for saving space (when moving to UTF-8) since the only characters that are a single byte are the same 128 characters that are the same across all code pages (at least all code pages available in Windows / SQL Server).

THE POINT IS: if Microsoft cannot communicate why the UTF-8 collations exist, what they do, when they should and should not be used, etc, then what chance do most users have of figuring out if they should use them or not?

Given the great potential for misusing this feature, I would have preferred to have either some other approach to the compatibility, or at least more time spent in planning and documenting before releasing these new collations. Maybe a startup-only trace flag to determine if you can access the UTF-8 collations. And, either don’t mention the space-saving potential of UTF-8, or at least mention the restrictions and consequences of using them. And of course, handle the saving-space issue as an entirely separate project.


 

UPDATE 2019-03-18 / CTP 2.3

CTP 2.3 was released on March 1st, 2019 (updated “What’s new” info), and includes:

  • 0 new UTF-8-related capabilities
  • 1 reported bug fixed
  • 2 reported bugs partially fixed
  • 0 new bugs found
  • 5 definite bugs, and 1 potential bug, still to fix
  • Please see new consolidated issue list below.

This CTP introduces the long-awaited BIN2 collation: UTF8_BIN2. This is a definite improvement over not having one, but there are a few issues that absolutely must be addressed.

Good:

  1. It exists (Yay!)
  2. No BIN collation to confuse people with.
  3. Only one BIN2 collation; no BIN2 per each culture, which would have added no benefit and likely would have confused people wondering what the difference was between the various UTF8_BIN2 collations.

Not So Good:

  1. Version 80 used instead of 140. This is a problem because version 80 is missing 649 uppercase / lowercase mappings that version 140 collations have. This affects the accuracy of the UPPER and LOWER functions.
  2. No culture name, so inconsistent with all other collations. Should be something like “Invariant” or “General”, or even “Latin1_General_version#_” if need be (since the UTF8_BIN2 collation does have an LCID of 1033, which is “Latin1_General”).
  3. No version number in the name, so inconsistent with all other Windows collations.

Also, this CTP partially fixes the installation failure when selecting a Unicode-only collation for the instance / server -level collation. When the UTF8 collations were first allowed to be selected during installation, the setup process failed with a non-descript error and the instance was left in a partial, unusable state. You would have to uninstall and then try again. Now, the installation still gets an error, but the process gets much farther so the instance is usable. The setup failure does leave the instance in single-user mode, but you can stop it and start it up again. There might also be some installation steps that were never processed due to the failure, but I’mm not sure which ones, or even how many.


 

UPDATE 2019-04-02 / CTP 2.4

CTP 2.4 was released on March 26th, 2019.

  1. No changes in any open issue, and no new issues
  2. This post will be updated in the next few days (hopefully by Tuesday, April 9th) to remove obsolete sections from the first few CTPs, consolidate recommendations / assesements, and add more context regarding the compatibility goal.


 

BUG / RESTRICTION TRACKING

The following is a consolidated list of the various issues surrounding the UTF-8 collations:

Issue Status Release
Collation Precedence allows for Unicode character data loss concatenating UTF8 with higher precedence non-UTF8 VARCHAR Open 😿
Collation Precedence allows for silent truncation (data loss) implicitly converting Extended ASCII characters into UTF8 Open 😿
SSMS Always Encrypted wizard makes incorrect collation change for UTF8 columns Open 😿
SqlClient incorrectly decrypted and encrypted Always Encrypted data for UTF-8 collations Open 😿
Two collation problems with external tables in SQL 2019 Open 😿
Linked Servers Open 😿
In-memory OLTP Open 😿
Inconsistent handling of invalid UTF-8 byte sequences Open 😿
There are no binary Collations that support the UTF8 encoding!!! Why not? PARTIALLY Fixed in: CTP 2.3
Install fails with “Could not find the Database Engine startup handle.” if a Unicode-only Collation is selected PARTIALLY Fixed in: CTP 2.3
Can’t specify UTF-8 collation at the instance (server) level Fixed in: CTP 2.1
NVARCHAR to VARCHAR(MAX) can get Msg 8152 “String or binary data would be truncated” Fixed in: CTP 2.1
Invalid UTF-8 bytes get (Msg 682, Level 22, State 148) Internal error. Buffer provided to read column value is too small. OR get mysterious, changing 3rd byte of 2-byte value Fixed in: CTP 2.1
An error occurred while executing batch. Error message is: Internal connection fatal error. Error state: 15, Token : 2″ (disconnect) Fixed in: CTP 2.1
UTF-8 Collations don’t work with Replication Fixed in: CTP 2.2
CHARACTER_SET_NAME of INFORMATION_SCHEMA.COLUMNS is NULL when UTF-8 is used Fixed in: CTP 2.3

  1. According to Unicode Technical Note #14: A SURVEY OF UNICODE COMPRESSION (the linked PDF in this link):
    Because of the 1-byte representation of ASCII, many English-speaking users have come to regard UTF-8 as a compression format. This notion is puzzling to speakers of French and other Latin-script languages that use many non-ASCII characters, and ridiculous to users of almost every other script on Earth, for which UTF-8 uses at least one byte per character more than legacy encodings.
     
    Although UTF-8 is indeed “not extravagant” in size, there is a big difference between avoiding wastefulness and achieving compression. The primary reason to use UTF-8 should be to maintain ASCII transparency, not to achieve compression.
     
  2. According to Unicode Technical Note #12: UTF-16 FOR PROCESSING, Section 3: “UTF-8: Good for Compatibility”:
    UTF-8 was mainly designed to store Unicode filenames in an ASCII-friendly way. It is suitable for processing, but it is significantly more complex to process than UTF-16…
     
    …UTF-8 stores Latin text in a compact form compared to UTF-16 but does not provide any advantages for other scripts or even uses more space. There are other Unicode charsets (for example, SCSU and BOCU-1) that are more efficient for storage and data exchange.

    Please note that “SCSU” is the “Standard Compression Scheme for Unicode“, and “BOCU-1” is the “MIME-compatible application of the Binary Ordered Compression for Unicode (BOCU) algorithm.
      

8 thoughts on “Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?”

  1. As someone who has dedicated hours of personal research into understanding the differences between Unicode and UTF-8 in general, and with Sql Server in particular, your opening paragraphs are fantastically clear and well explained. Much appreciated. Working with federal data that is english only, we too have been plagued with the issue of 99% of strings are varchar but occassionally some esoteric hawaian characters not covered by the Latin collation show up. Here’s hoping the Sql Server team addresses the issues you found and get’s this right before final release.

    Liked by 1 person

  2. It’s worth mentioning that ISO has some conventions where their encodings, such as metric units, have to use the basic Latin alphabet and a set of punctuation marks, which are roughly UTF–8. Unicode is required to support this little subset in all of its languages, so that standard encodings can be used anywhere on earth.

    Like

    1. Hi Joe. I think you are misstating / misunderstanding what is going on here. I thought I had pointed out how the 1-byte range of UTF-8 (which is really only half of what 1 byte can represent) is exactly the same as 7-bit US-ASCII. It seems that I did not include that info before, but I have updated in a few places to be clearer about that association. That range, especially values 32 – 126 (see the chart just above the “Setting and Storing UTF-8 Values” section), are common across most code pages / character sets. That commonality across code pages is what made those particular characters become the only allowable characters across many standards and RFCs, many of which were published prior to the existence of UTF-8, which came out circa 1996. For example, the RFC for SMTP, published in 1982, was updated in 2008 and still refers to US-ASCII: RFC for SMTP (search for “ascii”).

      One of the design requirements of UTF-8 was that it work seemlessly on existing systems (UNIX in particular), without needing to convert any existing text files. This is why characters encoded using just the first 7 bits of UTF-8 match exactly to the 7-bit US-ASCII encoding. So it’s not that the subset is “roughly UTF-8”, since UTF-8 can encode the entire 1.1+ million character range of Unicode; it’s that a subset of UTF-8 is exactly standard ASCII.

      Also, Unicode doesn’t have languages; it has encodings. And there are only three (mainly): UTF-8, UTF-16, and UTF-32. But, due to endianness, both UTF-16 and UTF-32 have two different physical representations, making it 5 encodings in practice. BUT, it is only UTF-8 that has the ability to encode that subset of characters to be physically the same as 7-bit standard ASCII. For example, Latin Upper-case “A” is 0x65 in both US-ASCII and UTF-8. Yet it is either 0x0065 or 0x6500 in UTF-16, depending on endianness. While the non-zero byte of UTF-16 is the same as the only byte in UTF-8 and US-ASCII, it’s still 2 bytes vs 1 byte, and that makes a big difference when parsing a series of bytes.

      Finally, it is not the encodings that are standard (since there are 5 of them). It is the character set of Unicode that is truly standard, as it is the same across all 5 encodings. For example, “Upside-Down Face” is Code Point U+1F643 across all five encodings (see here – scroll down to the “Encodings” section).

      Like

Thoughts? Questions? Comments? Suggestions? Words of praise?

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s