Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?

Simpsons_AngelAndDevilHomer

(last updated: 2018-12-13 @ 09:35 EST / 2018-12-13 @ 14:35 UTC )

(NOTE: For recent update, please see “Update for CTP 2.1” section; will update soon for CTP 2.2)

For some time now, many of us have struggled with data that is mostly standard US-English / ASCII characters but also needs to allow for the occasional non-ASCII character. VARCHAR is typically one byte per character but can’t represent a wide range of accented characters all at the same time (more than would be found on a single 8-bit Code Page). NVARCHAR , being a Unicode datatype, can represent all characters, but at a cost: each character is typically 2 bytes. When we have data that is most often standard ASCII, but has the potential (whether it ever happens or not) to have non-ASCII characters (names, URLs, etc), we have no choice but to use NVARCHAR. However, then we are wasting space for all of the data that could fit into VARCHAR and take up half as much space. While some claim that “disk is cheap“, wasting space negatively impacts query performance, backup size, backup and restore times, etc.

One approach is to have two columns for the data, one for each datatype. The idea is to store values that are 100% ASCII in the VARCHAR column, and anything else in the NVARCHAR. Then you can have a computed column return the non-NULL column for that data. While this approach does work, it is not something you are going to implement in 50 or more columns.

So, thankfully, Microsoft provided a solution to this problem.

And it is…

(drum-roll please)…

SQLCLR (introduced in SQL Server 2005). This allows one to use .NET to access GZip/Deflate functionality such that you can compress data into VARBINARY(MAX) on the way in, and uncompress back to NVARCHAR(MAX) on the way out. And you can even get pre-done versions of these — Util_GZip and Util_GUnzip —in the Free version of SQL# :-). Of course, some folks don’t like SQLCLR (for some reason), and you need to decompress in order to view / search / filter on the value, and this might be a bit much to do on many columns (though not as bad as having both VARCHAR and NVARCHAR versions of the column).

So, thankfully, Microsoft provided a solution to this problem.

And it is…

Data Compression (introduced in SQL Server 2008). But this was a bit simplistic, only available in Enterprise Edition, and didn’t work for off-row values (i.e. row overflow and MAX values that can’t fit in-row).

So, thankfully, Microsoft provided a solution to this problem.

And it is…

Unicode Compression (introduced in SQL Server 2008 R2, and implemented as part of Data Compression). But this was still only available in Enterprise Edition, and still didn’t work for off-row values, and in-row NVARCHAR(MAX) values only get the simplistic compression.

So, thankfully, Microsoft provided a solution to this problem.

And it is…

Compression for the all editions, not just Enterprise (as of SQL Server 2016, SP1). But, this still didn’t work for off-row values, and in-row NVARCHAR(MAX) values only get the simplistic compression.

So, thankfully, Microsoft provided a solution to this problem.

And it is…

COMPRESS and DECOMPRESS built-in functions (introduced in SQL Server 2016). You can use these to Gzip and Ungzip values (without resorting to SQLCLR), and store the much smaller VARBINARY value. Do this in stored procedures to be transparent to the caller. But this is not for everyone, and this might be a bit much to do on many columns.

So, thankfully, Microsoft provided a solution to this problem.

And it is…

Support for MAX types in Clustered Columnstore Indexes (as of SQL Server 2017). But, this is not for every scenario, especially because it affects the whole table.

So, thankfully, Microsoft provided a solution to this problem.

And it is…

UTF-8 Encodings (introduced in the upcoming SQL Server 2019). UTF-8 appears to be the ideal solution: standard ASCII characters take up only 1 byte, all Unicode characters are supported, the value is still a string (not binary) so it can be searched / filtered on / indexed / etc, and it is the preferred encoding for HTML and XML files. No need for an extra column plus a computed column, no need for SQLCLR or COMPRESS, no binary values, and no getting frustrated that Unicode Compression doesn’t work with NVARCHAR(MAX). What’s not to love?

Let’s explore…

TL; DR: While interesting, the new UTF-8 Collations only truly solve a rather narrow problem, and are currently too buggy to use with confidence, especially as a database’s default Collation. These encodings really only make sense to use with NVARCHAR(MAX) data, if the values are mostly ASCII characters, and especially if the values are stored off-row. Otherwise you are better off using Data Compression. Unfortunately, this feature provides much more benefit to marketing than it does to users. What would have been a huge benefit to users (better compression and not limited to only helping ASCII charcters) is to get Unicode Compression working for NVARCHAR(MAX).

Basics

Documentation

Let’s start with what we are told about this new feature. According to the documentation, the new UTF-8 Collations:

  1. can be used …
    1. as a database-level default Collation
    2. as a column-level Collation
    3. by appending “_UTF8” to the end of any Supplementary Character-Aware Collation (i.e. either having “_SC” in their name, OR being of level 140 or newer and not being binary)
    4. with only the CHAR and VARCHAR
  2. (implied) have no effect on NCHAR and NVARCHAR data (meaning: for these types, the UTF-8 Collations behave the same as their non-UTF-8 equivalents
  3. “This feature may provide significant storage savings, depending on the character set in use.” (emphasis mine)

Installing

When installing SQL Server 2019, the default Collation (at least for systems with English as the OS-level language) is stillSQL_Latin1_General_CP1_CI_AS“. Why, why, WHY???? Why not use “Latin1_General_100_CI_AS_SC“, which has the same sensitivities but is newer, has sort weights for more characters, supports Supplementary Characters, and is a Windows Collation (hence you don’t prevent seeks on indexed VARCHAR columns when compared to Unicode data). Sure, changing the default Collation after all these years will result in some headaches when needing to work with existing systems (that didn’t bother to change the default Collation), but it ensures that new projects started by folks who don’t know to change it will spread farther and farther throughout the galaxy, tightening its stranglehold on us (i.e. making it even harder to move away from). But that has nothing to do with UTF-8, so we can ignore it for now ( <grrrrr> , but please vote for Make Latin1_General_(100)_CI_AS the default collation for US English).

On the “Customize Database Engine Collation” screen, when selecting a Windows Collation:

  1. Collations without a version number do not allow checking either the “Supplementary characters” or “Variation selector-sensitive” options.
  2. Collations with version numbers “90” or “100” allow for checking the “Supplementary characters” option, but not the “Variation selector-sensitive” option.
  3. Collations with version number “140” have the “Supplementary characters” option checked (and cannot be unchecked), and allow for checking the “Variation selector-sensitive” option.
  4. (NOTE: For recent update that renders this point obsolete, please see “Update for CTP 2.1” section) There is no means of selecting a UTF-8 Collation, even if the “Supplementary characters” option is checked.

Using

By executing the queries below, we can confirm that adding _UTF8 to the end of Supplementary Character-Aware Collations does produce valid Collation names. I executed the following in [master] (which is guaranteed to not be using a UTF-8 Collation as its default Collation since there was no way to install SQL Server with one):

-- Version 90 (SQL 2005; _SC version of it started with SQL Server 2012)
SELECT 'a' COLLATE Japanese_90_CS_AS_SC_UTF8;

-- Version 100 (SQL 2008; _SC version of it started with SQL Server 2012)
SELECT 'a' COLLATE Latin1_General_100_CI_AS_SC_UTF8;

-- Version 140 (SQL 2017; no _SC version since it is implicitly supported)
SELECT 'a' COLLATE Japanese_XJIS_140_CS_AS_UTF8;

All of the above queries returned successfully.

The following query will list all of the UTF-8 Collations:

SELECT col.[name],
       COLLATIONPROPERTY(col.[name], 'CodePage') AS [CodePage],
       COLLATIONPROPERTY(col.[name], 'Version') AS [Version]
FROM   sys.fn_helpcollations() col
WHERE  col.[name] LIKE N'%[_]UTF8'
ORDER BY col.[name];
-- 1552 rows, all with CodePage = 65001

There are 5507 total Collations in SQL Server 2019, which is 1552 more than the 3955 that exist in SQL Server 2017. Hence the only new Collations are these “_UTF8” Collations.

If you execute the query above, do you notice anything missing? That’s right: no binary options for the “_UTF8” Collations (this is a mistake that needs to be addressed).

The following query shows that we can indeed use use VARCHAR to store Unicode characters, but only when using one of the new UTF-8 Collations:

SELECT NCHAR(27581), -- 殽
       CONVERT(VARCHAR(3), NCHAR(27581)
                           COLLATE Latin1_General_100_CI_AS_SC), -- ?
       CONVERT(VARCHAR(3), NCHAR(27581)
                           COLLATE Latin1_General_100_CI_AS_SC_UTF8); -- 殽

The following query confirms that the new UTF-8 Collations cannot be used with the TEXT datatype:

SELECT CONVERT(TEXT, N'a' COLLATE Latin1_General_100_CI_AS_SC_UTF8);

returns:

Msg 4189, Level 16, State 0, Line XXXXX
Cannot convert to text/ntext or collate to 'Latin1_General_100_CI_AS_SC_UTF8' because these legacy LOB types do not support UTF-8 or UTF-16 encodings. Use types varchar(max), nvarchar(max) or a collation which does not have the _SC or _UTF8 flags.

The following query shows that the NVARCHAR value is the same between the UTF-8 and non-UTF-8 Collation, and the same character can be held in VARCHAR, but only when using a UTF-8 Collation, and as a different underlying byte sequence:

DECLARE @Sample NVARCHAR(1) = NCHAR(0x5010);
SELECT @Sample; -- 倐

SELECT CONVERT(VARBINARY(2), @Sample
                    COLLATE Latin1_General_100_CI_AS_SC), -- 0x1050
       CONVERT(VARBINARY(2), @Sample
                    COLLATE Latin1_General_100_CI_AS_SC_UTF8), -- 0x1050

       CONVERT(VARCHAR(10), @Sample
                    COLLATE Latin1_General_100_CI_AS_SC), -- ?

       CONVERT(VARCHAR(10), @Sample
                    COLLATE Latin1_General_100_CI_AS_SC_UTF8), -- 倐
       CONVERT(VARBINARY(2), CONVERT(VARCHAR(10), @Sample
                    COLLATE Latin1_General_100_CI_AS_SC_UTF8)); -- 0xE580;

And, the following demonstrates that a UTF-8 Collation can be used at the database level:

CREATE DATABASE [UTF8] COLLATE Latin1_General_100_CI_AS_SC_UTF8;

ALTER DATABASE [UTF8] SET RECOVERY SIMPLE;

USE [UTF8];

The USE statement is there because the remaining examples that follow in this post will be executed from within the “UTF8” database.

General UTF-8 Info

UTF-16 (for context)

UTF-16, which is the Unicode encoding used for NVARCHAR data, uses 16-bit (i.e. 2-byte) values, known as “code units“, to represent characters. These “code units” are used individually (a 2-byte code unit), or in pairs (2 two-byte code units) to get Supplementary Characters, to map to individual “code points“. UCS-2, which only maps the BMP characters (i.e. the first 65,536 code points), is a fixed-length encoding because it only deals with single code units. UTF-16, which uses the exact same code units as UCS-2, can use certain combinations of two of those code units (called Surrogate Pairs) to map the non-BMP characters (i.e. code points 65,537 and above, known as Supplementary Characters).

Because these values are two bytes, they are subject to “endianness”, which refers to the order in which the individual bytes that make up the value are stored internally (in memory, on disk, etc). The hardware determines the “endianness” of how multi-byte values (including INT, SMALLINT, BIGINT) are stored, and the two possibilities are “Big Endian”, which is the given / natural ordering (i.e. value 0x1234 — where 0x12 is byte #1, and 0x34 is byte #2 — is stored as 0x1234), or “Little Endian”, which refers to storing the bytes in reverse-order (i.e. value 0x1234 is stored as 0x3412).

Unicode strings in Windows, “string” in .NET, and NVARCHAR in SQL Server are all UTF-16 Little Endian.

UTF-8

UTF-8 is another Unicode encoding which uses between one and four 8-bit (i.e. 1-byte) code units to represent characters. Even though it is Unicode, due to working 8-bits at a time, this encoding is considered VARCHAR data. This is also a variable-length encoding because it requires anywhere from 1 to 4 code units to map to any given code point. Due to these code units being single bytes (even in combinations of 2, 3, or 4), there is no concept of “endianness”.

One main advantage of UTF-8 is that it’s fully compatible with any existing data that’s comprised entirely of 7-bit ASCII characters (values 0 – 127). This is by design, but not really a reason for SQL Server to use it for storing data internally. The other main advantage of UTF-8 is that it can take up half the space while still allowing for the full range of Unicode characters. This would be a reason for SQL Server to use it internally. BUT, it only takes up half the space for a small subset of characters. Of course, if the characters in this small subset that are only 1 byte are the vast majority of characteres are being used, then this encoding does provide for significant space savings (in terms of disk space, memory, and reduced network traffic). What people tend to overlook, though, is that most Unicode characters require 3 bytes, which is 1 more byte than UTF-16 for those same characters.

You can execute the following query in SQL Server 2019 to see how many bytes each character requires for both UTF-8 and UTF-16 encodings. It returns all 65,536 BMP (Base Multilingual Plan) characters (which is also the entire UCS-2 character set), and 3 Supplementary Characters. Since all Supplementary Characters are 4 bytes in both encodings, there is no need to return more of them, but we do need to see a few of them to see that they are a) all 4 bytes, and b) encoded slightly differently. The query will work the same regardless of whether or not the current database uses a “_UTF8” Collation. It also does not matter if the database’s default Collation is Supplementary Character-Aware or not.

;WITH nums([CodePoint]) AS
(
  SELECT TOP (65536) (ROW_NUMBER() OVER (ORDER BY (SELECT 0)) - 1)
  FROM   [master].[sys].[columns] col
  CROSS JOIN [master].[sys].[objects] obj
), chars AS
(
  SELECT nums.[CodePoint],
         CONVERT(VARCHAR(4), NCHAR(nums.[CodePoint])
                    COLLATE Latin1_General_100_CI_AS_SC_UTF8) AS [TheChar],
         CONVERT(VARBINARY(4), CONVERT(VARCHAR(4), NCHAR(nums.[CodePoint])
                    COLLATE Latin1_General_100_CI_AS_SC_UTF8)) AS [UTF8]
  FROM   nums
  UNION ALL
  SELECT tmp.val,
         CONVERT(VARCHAR(4), CONVERT(NVARCHAR(5), tmp.hex)
                    COLLATE Latin1_General_100_CI_AS_SC_UTF8) AS [TheChar],
         CONVERT(VARBINARY(4), CONVERT(VARCHAR(4), CONVERT(NVARCHAR(5),
              tmp.hex) COLLATE Latin1_General_100_CI_AS_SC_UTF8)) AS [UTF8]
  FROM   (VALUES (65536, 0x00D800DC), -- Linear B Syllable B008 A (U+10000)
                 (67618, 0x02D822DC), -- Cypriot Syllable Pu (U+10822)
                 (129384,0x3ED868DD)  -- Pretzel (U+1F968)
         ) tmp(val, hex)
)
SELECT chr.[CodePoint],
       COALESCE(chr.[TheChar], N'TOTALS:') AS [Character],
       chr.[UTF8] AS [UTF8_Hex],
       DATALENGTH(chr.[UTF8]) AS [UTF8_Bytes],
       COUNT(CASE DATALENGTH(chr.[UTF8]) WHEN 1 THEN 'x' END) AS [1-byte],
       COUNT(CASE DATALENGTH(chr.[UTF8]) WHEN 2 THEN 'x' END) AS [2-bytes],
       COUNT(CASE DATALENGTH(chr.[UTF8]) WHEN 3 THEN 'x' END) AS [3-bytes],
       COUNT(CASE DATALENGTH(chr.[UTF8]) WHEN 4 THEN 'x' END) AS [4-bytes],
      ---
      CONVERT(VARBINARY(4), CONVERT(NVARCHAR(3), chr.[TheChar]))
         AS [UTF16(LE)_Hex],
      DATALENGTH(CONVERT(NVARCHAR(3), chr.[TheChar])) AS [UTF16_Bytes],
      ---
      ((DATALENGTH(CONVERT(NVARCHAR(3), chr.[TheChar]))) -
            (DATALENGTH(chr.[TheChar]))) AS [UTF8savingsOverUTF16]
FROM   chars chr
GROUP BY ROLLUP ((chr.[CodePoint], chr.[TheChar], chr.[UTF8]));

There isn’t enough room here to show all 65,538 rows, so here is a summary of the info:

Code Points Code Point Quantity UTF-8 Size (bytes) UTF-16 Size (bytes) UTF-8 savings
0 – 127 128 1 2 1
127 – 2047 1920 2 2 0
2048 – 65535 63488 3 2 -1
65536 – lots 4 4 0

As you can see, only the first 128 Code Points offer any size savings over UTF-16. The characters in that range are (minus the first 32 which are control characters, and the last character):

[space] ! " # $ % & ' (
) * + , . / 0 – 9 :
; < = > ? @ A – Z [ \
] ^ _ ` a – z { | } ~

Those characters, and their values (32 – 126), are exactly the same as the standard ASCII character set, and are the same across most code pages (certainly all code pages supported by SQL Server). Any characters not in the chart above do not offer any space savings, with most of the BMP characters actually taking up more space than they would if using UTF-16 (i.e. NVARCHAR ).

Setting and Storing UTF-8 Values

An encoding is how the characters are represented internally. Whether the source is a non-Unicode string, a Unicode string, or a binary string (i.e. series of bytes / hex bytes), the end result will be the chosen encoding’s value for that character.

First, let’s try setting a variable, in the “[UTF8]” database, from a VARBINARY literal:

DECLARE @HotDog VARCHAR(4) = 0xF09F8CAD;
PRINT @HotDog; -- ( U+1F32D )

Returns:   🌭 (in the “Messages” tab)

As you can see, it produces the expected character from the given 4 bytes. However, those same 4 bytes do not produce the correct character when stored as NVARCHAR , which expects UTF-16 Little Endian bytes sequences:

DECLARE @NotHotDog NVARCHAR(4) = 0xF09F8CAD;
PRINT @NotHotDog; -- ( U+1F32D )
-- 鿰권

Now let’s look at how the size of the (N)VARCHAR types impact what can be stored in them:

SELECT CONVERT( VARCHAR(3), 0xF09F8CAD) AS [TooSmallForUTF8-HotDog],
     DATALENGTH(CONVERT(VARCHAR(3), 0xF09F8CAD)) AS [ByteCountForTooSmall],
       CONVERT( VARCHAR(4), 0xF09F8CAD) AS [UTF8-HotDog],
       DATALENGTH(CONVERT(VARCHAR(4), 0xF09F8CAD)) AS [ByteCountForHotDog],
       ------
       CONVERT(NVARCHAR(1), 0x3CD82DDF) AS [TooSmallForUTF16-HotDog],
       CONVERT(NVARCHAR(2), 0x3CD82DDF) AS [UTF16-HotDog];

Returns:

TooSmallFor
UTF8-HotDog
ByteCountFor
TooSmall
UTF8-
HotDog
ByteCountFor
HotDog
TooSmallFor
UTF16-HotDog
UTF16-
HotDog
0 🌭 4 🌭

The first column appears to return nothing, not even a partial, incorrect / unknown character. The second column confirms that there is absolutely nothing resulting from attempting to store a 4-byte UTF-8 value in a VARCHAR type that is only 3 bytes. It couldn’t even take just the first 3 bytes of the 4-byte value. The third and fourth columns show that given the type the required 4 bytes allows everything to work as expected.

On the other hand, when attempting to store the same character, using the UTF-16 4-byte sequence, into an NVARCHAR type that is too small, the result is the default Replacement Character ( U+FFFD ). This is because it was able to store the first half of the Surrogate Pair (the 0x3CD8), which is meaningless on its own, but it is still a valid byte sequence.

Size and Performance

{ This section is taking a while as I figure out how to best present the variety of tests and their results, but the general info is all there. }

Data Profiles

  1. Some ASCII (values 1 – 254), but mostly non-ASCII
    • [N]VARCHAR(500)
    • 10,539 rows
    • Characters per row: MIN = 1; AVG = 14; MAX = 30
  2. Some ASCII MAX
    • [N]VARCHAR(MAX)
    • 10,539 rows
    • Characters per row: MIN = 1; AVG = 14; MAX = 30
       
  3. All ASCII
    • [N]VARCHAR(500)
    • 105,390 rows
    • Characters per row: MIN = 6; AVG = 24; MAX = 60
  4. All ASCII MAX (all in row)
    • [N]VARCHAR(MAX)
    • 105,390 rows
    • Characters per row: MIN = 6; AVG = 24; MAX = 60
       
  5. All ASCII MAX (10k characters per row; all off-row)
    • [N]VARCHAR(MAX)
    • 10,539 rows
    • Characters per row: MIN = 10,007; AVG = 10,025; MAX = 10,061
  6. All ASCII MAX (5k characters per row; mixed in / off -row; UTF-8 = in-row, NVARCHAR = off-row)
    • [N]VARCHAR(MAX)
    • 21,078 rows
    • Characters per row: MIN = 5,007; AVG = 5,025; MAX = 5,061

Tests:

These were executed in a database which has a “_UTF8” default Collation.

SELECT REVERSE([Value])
FROM dbo.[{table}];


SELECT SUBSTRING([Value], 5, 13)
FROM dbo.[{table}];


SELECT [Value]
FROM dbo.[{table}]
WHERE [Value] = 'abcdef12345'; -- UTF16 test uses N'abcdef12345'


SELECT [Value]
FROM dbo.[{table}]
WHERE [Value] LIKE '%123%'; -- UTF16 test uses N'%123%'


SELECT MIN(LEN([Value])), AVG(LEN([Value])), MAX(LEN([Value]))
FROM   dbo.[{table}];


SELECT [Value]
FROM   dbo.[{table}];

Each test was wrapped in:

SET STATISTICS TIME ON;
...{test}...
SET STATISTICS TIME OFF;

Results

  1. Some ASCII in non-MAX types
    • NVARCHAR is smaller here when Compression = NONE, ROW, or PAGE. COLUMNSTORE estimated better as well, but not tested.
    • Did not test performance given no space savings
  2. Some ASCII in MAX types (in-row)
    • NVARCHAR is smaller here when Compression = NONE, ROW, or PAGE. COLUMNSTORE estimated the same, but not tested.
    • Did not test performance given no space savings
       
  3. All ASCII in non-MAX types
    • UTF-8 is smaller here when Compression = NONE (approx. 37% smaller than NVARCHAR)
    • UTF-8 is only slightly smaller here when Compression = ROW or PAGE (approx. 1% smaller)
    • NVARCHAR is slightly smaller here when table is COLUMNSTORE (approx. 1% smaller)
    • Performance of NVARCHAR is same or better than UTF-8
  4. All ASCII in MAX types (in-row)
    • UTF-8 is smaller here when Compression = NONE (approx. 37% smaller than NVARCHAR; 40% for ROW and PAGE)
    • UTF-8 is only slightly smaller here when table is CLUSTERED COLUMNSTORE (approx. 9% smaller)
    • Performance is sometimes the same, but NVARCHAR is usually better than UTF-8
  5. All ASCII in MAX types (10k; off-row)
    • UTF-8 is smaller here when Compression = NONE (approx. 46% smaller)
    • ROW and PAGE compression do not apply to off-row data
    • UTF-8 is only slightly smaller here when table is CLUSTERED COLUMNSTORE (approx. 1.3% smaller). The NVARCHAR table is still 98% smaller than it was originally.
    • With no compression, NVARCHAR is faster for all tests.
    • When table is CLUSTERED COLUMNSTORE, NVARCHAR is faster for all tests except “SUBSTRING” and “=”.
  6. All ASCII in MAX types (5k; mixed on / off-row: VARCHAR = on, NVARCHAR = off)
    • UTF-8 is smaller here when Compression = NONE, ROW, or PAGE (approx. 25% smaller)
    • UTF-8 is much smaller here when table is CLUSTERED COLUMNSTORE (approx. 92% smaller). BUT, this is typically a non-issue since the NVARCHAR table is still 96% smaller than it was originally. This becomes an issue as the uncompressed data gets closer to 100 GB.
    • With no compression, NVARCHAR is faster for all tests except “LIKE ‘%x%'”.
    • When table is CLUSTERED COLUMNSTORE, UTF-8 is faster for all tests except REVERSE (elapsed time was the same, but CPU was better)

Typically there is at least a slight performance hit when using UTF-8. I believe the issue is due to UTF-8 being a variable-length encoding, which means that each byte must be interpreted as it is read in order to know if it is a complete character or if the next byte is a part of it. This means that all string operations need to start at the beginning and proceed byte-by-byte. On the other hand, NVARCHAR / UTF-16 is always 2 bytes (even Supplementary Characters are comprised of two 2-byte Code Points), so everything can be read in 2-byte chunks.

The area of biggest improvement of UTF-8 over NVARCHAR is when there is between 4001 and 8000 bytes of mostly standard ASCII characters. In this scenario, you would need to use NVARCHAR(MAX) to get over 4000 bytes, but then you lose the ability to use Unicode Compression. And, being over 4000 bytes, the NVARCHAR data must be stored off-row (a slight performance hit), while that same data can be stored in-row when using UTF-8. This is the only scenario where UTF-8 is better for both space and performance.

Limitations

Known (i.e. Stated in Documentation)

The UTF8 Collations do not work with:

  1. Replication
  2. Linked Servers
  3. In-memory OLTP   (NOTE: For clarification based on recent testing, please see “Update for CTP 2.1” section)
  4. External Table for Polybase
  5. (NOTE: For recent update that renders this point obsolete, please see “Update for CTP 2.1” section) UTF8 Collations cannot be used as the Instance-level Collation (** please see next section for the “EXCEPT, …”):
    • Currently wording is that they can only be used at the database and column level, but yes, they can also be used inline via the COLLATE keyword.
    • Not an option via the installer (as mentioned at the top of this post)
    • Cannot be specified via:
      SETUP.EXE /QUIET /ACTION=REBUILDDATABASE ^
      /SQLCOLLATION=Latin1_General_100_CI_AS_SC_UTF8 ^
      ...

      The result is:
      The following error occurred:
      The collation Latin1_General_100_CI_AS_SC_UTF8 was not found.
       
      Error result: -2061893630
      Result facility code: 1306
      Result error code: 2

Unknown (i.e. Oops)

  1. (NOTE: For recent update that renders this point obsolete, please see “Update for CTP 2.1” section) UTF8 Collations can be used as the Instance-level Collation. This cannot be set via the installer or running SETUP.EXE with the /SQLCOLLATION switch, BUT they can be used via the undocumented method (of changing all Collations for an Instance):
    sqlservr -c -m -T4022 -T3659 -s"{instane_name}" ^
    -q"Latin1_General_100_CI_AS_SC_UTF8"`

    Yes, this does work. I have tested it. For more details on this method, please see:
    Changing the Collation of the Instance, the Databases, and All Columns in All User Databases: What Could Possibly Go Wrong?
    PLEASE NOTE: Just because you can do this doesn’t mean that you should do this. Given that the only way of setting an Instance to use a UTF8 Collation is via an undocumented method, it is quite possible that this has never been tested internally at Microsoft and is not guaranteed to work. It might work perfectly, or there could be unexpected behaviors. Use at your own risk.
     
  2. There are no binary Collations that support the UTF8 encoding!!! Why not?. But why do we need them? Here are some reasons:
    1. Microsoft’s implementation of Unicode has incomplete, and some incorrect, sort weights. One example:
      Choosing a binary collation that can differentiate between ‘ss’ and ‘ß’ for nvarchar column in Sql Server
    2. When needing to know if two values are 100% identical, you need a binary Collation (such as when auditing for changes, etc). Even if all characters had correct sort weights, you still wouldn’t be able to differentiate between a base character plus a combining diacritic and the pre-composed single character equivalent. Nor would you be able to differentiate between a base character plus multiple combining diacritics and the same characters with a different order for the diacritics. In both of those cases, the visual representation of the “character” is the same, but the number or order of the underlying Code Points would be different. The first item from this post provides a working example: No, Binary Collations are not Case-Sensitive
    3. When storing alphanumeric codes (e.g. airline confirmation codes, phone numbers, postal codes, etc), binary sorting and comparison is much faster since it ignores all linguistic rules, which is perfect for scenarios such as these where there is no expectation of applying such rules.
    4. Finding / removing “(null)” character U+0000 / CHAR(0)
       
  3. Bugs:
    1. NVARCHAR to VARCHAR(MAX) can get Msg 8152 “String or binary data would be truncated” — FIXED IN CTP 2.1
    2. Invalid UTF-8 bytes get (Msg 682, Level 22, State 148) Internal error. Buffer provided to read column value is too small. OR get mysterious, changing 3rd byte of 2-byte value — FIXED IN CTP 2.1
    3. Inconsistent handling of invalid UTF-8 byte sequences
    4. CHAR() function broken in databases with “_UTF8” default Collation (not a bug after all; just needs better documentation)
    5. An error occurred while executing batch. Error message is: Internal connection fatal error. Error state: 15, Token : 2″ (disconnect) — caused by NULL UTF-8 values — FIXED IN CTP 2.1
    6. Collation Precedence allows for Unicode character data loss concatenating UTF8 with higher precedence non-UTF8 VARCHAR

Recommended Uses / Guidance

The UTF-8 encoding, being a variable-length encoding, can be a huge benefit in some scenarios, but it can also make things worse in others. Unfortunately, there is very little use for a “_UTF8” encoding given that Data Compression and Clustered Columnstore Indexes are available across all editions of SQL Server. The only scenario that truly benefits from a UTF-8 encoding is one in which all of the following conditions are true:

  1. Data is mostly standard ASCII (values 0 – 127), but either has, or might have, a small amount of a varying range of Unicode characters (more than would be found on a single 8-bit Code Page, or might not exist on any 8-bit Code Page)
  2. Column is currently (or otherwise would be) NVARCHAR(MAX) (meaning, data won’t fit into NVARCHAR(4000))
  3. There is a lot of data for this column or set of columns (1 GB or more when stored in NVARCHAR)
  4. Performance would be negatively impacted by making the table a Clustered Columnstore table (due to how the table is used) OR data is typically < 8000 bytes
  5. There is no desire to make the column VARBINARY(MAX), use COMPRESS() for INSERT and UPDATE operations, and use DECOMPRESS() for SELECT queries (no need to worry about lack of ability to index the VARBINARY value since it is MAX data anyway that cannot be indexed). Keep in mind that the Gzipped value will be much smaller than even the UTF-8 version of the string, though it would require decompressing before the value could be filtered on (outside of “=”) or manipulated.
  6. The benefits of reducing the size of backups and reducing the time it takes to backup and restore, and reducing the impact on the buffer pool, outweigh the cost of the likely negative impact on query performance (for both CPU and elapsed times). Just keep in mind that Backup Compression (available in Enterprise and Standard Editions) might help here.

Storing HTML pages is a good example of a scenario that fits this description. UTF-8 is, of course, the preferred encoding for the interwebs precisely due to it using the minimal space for the most common characters while still allowing for the full range of Unicode characters.

Conclusion

Adding support for UTF-8 is certainly interesting. However, being the preferred encoding for the web, and even the default encoding for Linux / Unix (at least some flavors), doesn’t imply that UTF-8 is truly usefull within SQL Server. UTF-8 is a great choice for scenarios where there is:

  1. a need to be fully compatible with standard ASCII (to work with legacy systems / code)
    and / or
    a desire to at least save space when most data is standard ASCII anyway,
  2. a need / desire to support all characters (especially without needing to change Code Pages), and
  3. no ability to use Unicode Compression.

So, within the context of SQL Server, UTF-8 has the potential for helping a very specific, but not unimportant, use case: NVARCHAR(MAX) columns, where Unicode Compression doesn’t work. Still, I think the development / testing time spent on this would have been much better applied to something more useful to more users, such as the following (the first two of which would do a better job of accomplishing the space-saving goal of UTF-8, and without sacrificing performance):

  • getting Unicode Compression (part of Data Compression) to work on at least in-row values of NVARCHAR(MAX), if not also off-row values (please vote for: “unicode compression nvarchar(max)” — this would save more space than UTF-8)
  • getting Data Compression to work for off-row LOB / Row-Overflow values (please vote for: “Page Compression for Out-of-Row LOB Storage” — this would save space for more than just Unicode strings: it would also apply to VARCHAR, VARBINARY, and XML)
  • getting mixed alphanumeric values to sort as expected (i.e. “natural sorting” — “21a” sorts after “3a” instead of before — please vote for: “Support "natural sorting" / DIGITSASNUMBERS as a Collation option“)

Given that it is not fun to change Collations for databases, I would strongly recommend against using the new UTF-8 Collations, at least not at the database level, and definitely not in Production. With the obviously missing binary versions, and 5 bugs found in only a couple of days of testing for this post (not testing a full, complex application), this feature is incomplete, and there might yet be more bugs / odd behavior. I would suggest, if it looks like it will take quite a bit of development time to fix these six issues, then perhaps just remove this feature entirely (no big loss as it provides very little, if any, benefit) and spend that time on one of the three items noted directly above (think of how many customers / projects would benefit from Data Compression working for off-row LOB / overflow data).

If you are going to use a UTF-8 Collation on one or more columns, make no assumptions about size reduction or better performance! Be sure to capture current size and performance stats while column is NVARCHAR, and then compare both after changing to UTF-8.

Resources

UPDATE 2018-11-27 / CTP 2.1

CTP 2.1 was released in early November (updated “What’s new” info), and includes:

Quirks

Here are some issues that you might run into when using a UTF-8 Collation at either the Instance level or Database level. These are most likely not bugs since they have been deprecated for a while now. Still, it is possible that you have legacy systems and/or 3rd party applications (some that you cannot change) using these features.

  1. TEXT / NTEXT datatypes in temporary tables (when Instance-level Collation is “_UTF8”) and table variables (when Database-level Collation is “_UTF8”):
      CREATE TABLE [#D'oh] (col1 NTEXT); -- ' (CSS formatting bug work-around)
      /*
      Msg 4188, Level 16, State 1, Line XXXXX
      Column or parameter 'col1' has type 'ntext' and collation
         'Japanese_XJIS_140_CI_AI_VSS_UTF8'. The legacy LOB types do not
         support UTF-8 or UTF-16 encodings. Use types varchar(max),
         nvarchar(max) or a collation which does not have the _SC or
         _UTF8 flags.
      */
      

    Of course, there is an easy work-around to these errors: for columns using either of these datatypes, use the COLLATE keyword to specify a non-UTF8 Collation (e.g. “COLLATE Latin1_General_100_CI_AS“).

  2. sys.fn_trace_gettable: The “TextData” column of the result set uses the NTEXT datatype. This should only be an issue when using a UTF-8 Collation at the Instance level.
  3. sys.fn_get_sql: The “text” column of the result set uses the TEXT datatype (which was wrong even before this function was deprecated since the datatype should have been NTEXT 😲 ):
      SELECT * FROM sys.fn_get_sql(0x03);
      /*
      Msg 4189, Level 16, State 0, Procedure fn_get_sql,
         Line XXXXX [Batch Start Line YYYYY]
      Cannot convert to text/ntext or collate to
         'Japanese_XJIS_140_CI_AI_VSS_UTF8' because these legacy LOB
         types do not support UTF-8 or UTF-16 encodings. Use types
         varchar(max), nvarchar(max) or a collation which does not
         have the _SC or _UTF8 flags.
      Msg 4413, Level 16, State 1, Line XXXXX
      Could not use view or function 'fn_get_sql' because of binding errors.
      */
      

    This should only be an issue when using a UTF-8 Collation at the Instance level.

You can use the following query to check your system to see if there are any columns using either the TEXT or NTEXT datatypes. This does not check either stored procedure / trigger / function / SQL Agent T-SQL job steps content to see if these datatypes are being used in temporary tables or table variables or permanent tables being created or altered, nor does it check T-SQL code submitted by anything outside of SQL Server (such as your app code). So be sure to test, test, test (which you were going to do anyway, right?).

Now, to be fair, these same quirks exist when using any Supplementary Character-Aware Collation (i.e. one with either “_SC” or “_140_” in its name) at the Instance or Database level.

Please note that the following query only reports objects for the database in which it is being executed (also, the two deprecated functions mentioned above will show up in all databases as they are global).

SELECT OBJECT_SCHEMA_NAME(ac.[object_id]) AS [SchemaName],
       ao.[name] AS [ObjectName],
       ao.[type_desc] AS [ObjectType],
       ac.[name] AS [ColumnName],
       ac.[system_type_id],
       ac.[collation_name]
FROM   sys.all_columns ac
INNER JOIN sys.all_objects ao
        ON ao.[object_id] = ac.[object_id]
WHERE  ac.[system_type_id] IN (35, 99); -- 35 = TEXT ; 99 = NTEXT

In-memory OLTP

The documentation states that the “_UTF8” Collations do not work with In-memory OLTP. We can easily confirm that via the following tests.

This:

CREATE TABLE dbo.UTF8
(
    [ID] INT IDENTITY(1, 1)
        NOT NULL
        PRIMARY KEY NONCLUSTERED HASH WITH (BUCKET_COUNT = 400),
    [Name] VARCHAR(50) COLLATE Latin1_General_100_CI_AS_SC_UTF8
        NOT NULL
)
WITH (MEMORY_OPTIMIZED = ON);

and this:

CREATE TABLE dbo.UTF8
(
    [ID] INT IDENTITY(1, 1)
        NOT NULL
        PRIMARY KEY NONCLUSTERED HASH WITH (BUCKET_COUNT = 400),
    [Name] VARCHAR(50) COLLATE Latin1_General_100_CI_AS_SC -- NOT UTF8
        NOT NULL
)
WITH (MEMORY_OPTIMIZED = ON);
-- Success

ALTER TABLE dbo.[UTF8]
    ALTER COLUMN [Name]
    VARCHAR(50) COLLATE Latin1_General_100_CI_AS_SC_UTF8 NOT NULL;

both get the following error:

Msg 12356, Level 16, State 157, Line XXXXX
Comparison, sorting, and manipulation of character strings that use a UTF8 collation is not supported with memory optimized tables.

And this statement (same as the first CREATE statement directly above, but including an INDEX on the [Name] column):

CREATE TABLE dbo.UTF8
(
    [ID] INT IDENTITY(1, 1)
        NOT NULL
        PRIMARY KEY NONCLUSTERED HASH WITH (BUCKET_COUNT = 400),
    [Name] VARCHAR(50) COLLATE Latin1_General_100_CI_AS_SC_UTF8
        NOT NULL INDEX [IX_UTF8_Name] NONCLUSTERED
)
WITH (MEMORY_OPTIMIZED = ON);

gets the following error:

Msg 12357, Level 16, State 158, Line XXXXX
Indexes on character columns that use a UTF8 collation are not supported with indexes on memory optimized tables.

And, attempting to change a Database’s default Collation to be a “_UTF8” Collation will error:

ALTER DATABASE [Hekaton] COLLATE Latin1_General_100_CI_AS_SC_UTF8;

receives:

Msg 41335, Level 16, State 3, Line XXXXX
Modifying the collation of a database is not allowed when the database contains memory optimized tables or natively compiled modules.

(that error message is not entirely accurate since you can change the Collation to a non-UTF-8 Collation, though you still might need to drop natively compiled modules first.)

HOWEVER, using the “sqlservr -q” method of changing all Collations for an Instance allows us to force a UTF-8 Collation on a Database containing memory-optimized tables, and even on columns in those memory-optimized tables:

sqlservr.exe -c -m -T4022 -T3659 -s"{instance_name}" ^
-q"Latin1_General_100_CI_AS_SC_UTF8"

After this (unsupported) change, the Database generally works, but there are a few definite issues:

  1. Data corruption in VARCHAR columns for characters with values 128 and above (i.e. outside of the standard ASCII set that ends at value 127). This affects “durable” memory-optimized tables (i.e. DURABILITY = SCHEMA_AND_DATA). For example, if the original Collation is Korean_100_CI_AS, then the byte sequence of “0xB15A” will produce “켣”. If you then use this method to switch to any “_UTF8” Collation, you will end up seeing “�Z” because “0xB1” is an invalid byte sequence in UTF-8, and “0x5A” is valid and is the “Z” character.
  2. Indexes on memory-optimized tables will not be dropped and recreated, and so in many cases they will be out of order (i.e. in the wrong order). For example, switching from Latin1_General_100_CI_AS_SC to Latin1_General_100_CI_AS_SC_UTF8 (only difference is adding “_UTF8” to the end of the original Collation) would cause indexes to become out of order, but only if they have key columns that are VARCHAR with data containing characters of values 128 or higher. (INCLUDE columns that are VARCHAR will have corrupted data for characters of values 128 or higher, but this does not affect the order.)
  3. Related to out-of-order indexes: a simple SELECT works, even when adding in ORDER BY or GROUP BY. Adding a WHERE clause will also work, as long as no new rows have been added. Once a new row has been added, using a WHERE clause referencing an indexed string column will result in the following error:
    Msg 9100, Level 21, State 2, Line XXXX
    Possible index corruption detected. Run DBCC CHECKDB.

    And you will be disconnected. OR, after adding a row, you might be able to successfully use a WHERE clause, but the first time you use the ORDER BY clause you will get the following error:
    Msg 701, Level 17, State 153, Line XXXXX
    There is insufficient system memory in resource pool 'default' to run this query.

    After that, you will probably get the “Possible index corruption detected” error, even when using just the WHERE clause.

Keep in mind that indexes on memory-optimized tables cannot be rebuilt or even dropped. Fixing these issues requires dropping, recreating, and reloading any affected tables. Again, this (using UTF-8 Collations with memory-optimized tables) is not a supported configuration.

4 thoughts on “Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?”

  1. As someone who has dedicated hours of personal research into understanding the differences between Unicode and UTF-8 in general, and with Sql Server in particular, your opening paragraphs are fantastically clear and well explained. Much appreciated. Working with federal data that is english only, we too have been plagued with the issue of 99% of strings are varchar but occassionally some esoteric hawaian characters not covered by the Latin collation show up. Here’s hoping the Sql Server team addresses the issues you found and get’s this right before final release.

    Liked by 1 person

  2. It’s worth mentioning that ISO has some conventions where their encodings, such as metric units, have to use the basic Latin alphabet and a set of punctuation marks, which are roughly UTF–8. Unicode is required to support this little subset in all of its languages, so that standard encodings can be used anywhere on earth.

    Like

    1. Hi Joe. I think you are misstating / misunderstanding what is going on here. I thought I had pointed out how the 1-byte range of UTF-8 (which is really only half of what 1 byte can represent) is exactly the same as 7-bit US-ASCII. It seems that I did not include that info before, but I have updated in a few places to be clearer about that association. That range, especially values 32 – 126 (see the chart just above the “Setting and Storing UTF-8 Values” section), are common across most code pages / character sets. That commonality across code pages is what made those particular characters become the only allowable characters across many standards and RFCs, many of which were published prior to the existence of UTF-8, which came out circa 1996. For example, the RFC for SMTP, published in 1982, was updated in 2008 and still refers to US-ASCII: RFC for SMTP (search for “ascii”).

      One of the design requirements of UTF-8 was that it work seemlessly on existing systems (UNIX in particular), without needing to convert any existing text files. This is why characters encoded using just the first 7 bits of UTF-8 match exactly to the 7-bit US-ASCII encoding. So it’s not that the subset is “roughly UTF-8”, since UTF-8 can encode the entire 1.1+ million character range of Unicode; it’s that a subset of UTF-8 is exactly standard ASCII.

      Also, Unicode doesn’t have languages; it has encodings. And there are only three (mainly): UTF-8, UTF-16, and UTF-32. But, due to endianness, both UTF-16 and UTF-32 have two different physical representations, making it 5 encodings in practice. BUT, it is only UTF-8 that has the ability to encode that subset of characters to be physically the same as 7-bit standard ASCII. For example, Latin Upper-case “A” is 0x65 in both US-ASCII and UTF-8. Yet it is either 0x0065 or 0x6500 in UTF-16, depending on endianness. While the non-zero byte of UTF-16 is the same as the only byte in UTF-8 and US-ASCII, it’s still 2 bytes vs 1 byte, and that makes a big difference when parsing a series of bytes.

      Finally, it is not the encodings that are standard (since there are 5 of them). It is the character set of Unicode that is truly standard, as it is the same across all 5 encodings. For example, “Upside-Down Face” is Code Point U+1F643 across all five encodings (see here – scroll down to the “Encodings” section).

      Like

Thoughts? Questions? Comments? Suggestions? Words of praise?

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s