Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?

Simpsons_AngelAndDevilHomer

(last updated: 2020-01-16 @ 15:00 EST / 2020-01-16 @ 20:00 UTC )

(NOTE: This post was first published over a year ago, on 2018-09-28, and a lot has changed regarding this feature since that initial write-up. So, this post is going through a slow process of incremental changes to better reflect the current state of the feature.)

 

For some time now, many of us have struggled with data that is mostly standard US-English / ASCII characters but also needs to allow for the occasional non-ASCII character. VARCHAR is typically one byte per character but can’t represent a wide range of accented characters all at the same time (more than would be found on a single 8-bit Code Page). NVARCHAR , being a Unicode datatype, can represent all characters, but at a cost: each character is typically 2 bytes. When we have data that is most often standard ASCII, but has the potential (whether it ever happens or not) to have non-ASCII characters (names, URLs, etc), we have no choice but to use NVARCHAR. However, then we are wasting space for all of the data that could fit into VARCHAR and take up half as much space. While some claim that “disk is cheap“, wasting space negatively impacts query performance, backup size, backup and restore times, etc.

One approach is to have two columns for the data, one for each datatype. The idea is to store values that are 100% ASCII in the VARCHAR column, and anything else in the NVARCHAR. Then you can have a computed column return the non-NULL column for that data. While this approach does work, it is not something you are going to implement in 50 or more columns.

So, thankfully, Microsoft provided a solution to this problem.

And it is…

(drum-roll please)…

SQLCLR (introduced in SQL Server 2005). This allows one to use .NET to access GZip/Deflate functionality such that you can compress data into VARBINARY(MAX) on the way in, and uncompress back to NVARCHAR(MAX) on the way out. And you can even get pre-done versions of these — Util_GZip and Util_GUnzip —in the Free version of SQL# :-). Of course, some folks don’t like SQLCLR (for some reason), and you need to decompress in order to view / search / filter on the value, and this might be a bit much to do on many columns (though not as bad as having both VARCHAR and NVARCHAR versions of the column).

So, thankfully, Microsoft provided a solution to this problem.

And it is…

Data Compression (introduced in SQL Server 2008). But this was a bit simplistic, only available in Enterprise Edition, and didn’t work for off-row values (i.e. row overflow and MAX values that can’t fit in-row).

So, thankfully, Microsoft provided a solution to this problem.

And it is…

Unicode Compression (introduced in SQL Server 2008 R2, and implemented as part of Data Compression). But this was still only available in Enterprise Edition, and still didn’t work for off-row values, and in-row NVARCHAR(MAX) values only get the simplistic compression.

So, thankfully, Microsoft provided a solution to this problem.

And it is…

Compression for the all editions, not just Enterprise (as of SQL Server 2016, SP1). But, this still didn’t work for off-row values, and in-row NVARCHAR(MAX) values only get the simplistic compression.

So, thankfully, Microsoft provided a solution to this problem.

And it is…

COMPRESS and DECOMPRESS built-in functions (introduced in SQL Server 2016). You can use these to Gzip and Ungzip values (without resorting to SQLCLR), and store the much smaller VARBINARY value. Do this in stored procedures to be transparent to the caller. But this is not for everyone, and this might be a bit much to do on many columns.

So, thankfully, Microsoft provided a solution to this problem.

And it is…

Support for MAX types in Clustered Columnstore Indexes (as of SQL Server 2017). But, this is not for every scenario, especially because it affects the whole table.

So, thankfully, Microsoft provided a solution to this problem.

And it is…

UTF-8 Encodings (introduced in the upcoming SQL Server 2019). UTF-8 appears to be the ideal solution: standard ASCII characters take up only 1 byte, all Unicode characters are supported, the value is still a string (not binary) so it can be searched / filtered on / indexed / etc, and it is the preferred encoding for HTML and XML files. No need for an extra column plus a computed column, no need for SQLCLR or COMPRESS, no binary values, and no getting frustrated that Unicode Compression doesn’t work with NVARCHAR(MAX). What’s not to love?

BUT WAIT! What if the UTF-8 collations weren’t added to solve this problem? What if they were added to solve a different problem, and saving space was just an added benefit? Would that change anything?

Let’s explore…

TL; DR: If you are not absolutely certain that you should be using the UTF-8 collations, then do not use them!

While interesting, the new UTF-8 Collations only truly solve a rather narrow problem (or possibly two narrow problems). For the purpose of compression, these encodings really only make sense to use with NVARCHAR(MAX) data, if the values are mostly ASCII characters, and especially if the values are stored off-row. Otherwise, you are better off using Data Compression, or possibly Clustered Columnstore Indexes.
 

Then why does this feature even exist? Why devote development resources to it? What problem(s) does it solve? It wasn’t needed for importing or exporting data since that was resolved two versions ago (in SQL Server 2016, or possibly even SQL Server 2014 SP2). And I certainly hope it wasn’t to save space (compared to NVARCHAR / UTF-16) since that isn’t what UTF-8 was designed for (or does, generally speaking). The reason why UTF-8 (as an encoding, irrespective of SQL Server) was created was to address compatibility (with existing ASCII-based systems) 1 , not efficiency (of space or speed) 2 . In fact, compatibility was achieved at the expense of efficiency, given that most characters take up more space in UTF-8 than they do in UTF-16. But, that was an acceptable trade-off for its intended use: allowing existing OSes (primarily Unix but soon also Windows) to be Unicode-capable while not breaking any existing functionality. UTF-8 sometimes saving space over UTF-16 (but only for 128 characters, mind you) is a side-effect, not a design goal. Which means that it’s good to know of the option, but it shouldn’t be used as a selling point for this feature (yet so far it’s the onlymain thing mentioned regarding this feature, in both documentation and presentations).
 

HOWEVER, info provided three months after CTP 2.0 was released indicates that the real purpose of the UTF-8 collations is actually compatibility after all. UTF-8 collations will allow apps working with VARCHAR data and datatypes to become Unicode capable with relatively few changes to the app code. This would also benefit projects that work with other database systems that already support UTF-8 in VARCHAR columns, and without requiring the upper-case "N" prefix on literals. This is an understandable goal, and is at least keeping with the purpose of UTF-8 (i.e. compatibility). But, they still do have some difficulties / nuances to be aware of.
 

The major issues with the UTF-8 collations are:

  1. the primary problem they are trying to solve (i.e. enable Unicode in existing systems currently using VARCHAR without much ability to change to NVARCHAR) most likely affects only a small portion of customers,
  2. the secondary benefit (i.e. saving space) only happens under specific conditions otherwise can cause the data to expand in size,
  3. there is typically a slight performance hit when using UTF-8 collations,
  4. extensive testing is required because only the first 128 characters are 1 byte each, so it will be easy to miscalculate max size needed for variables and columns, and not having enough space leads to silent truncation (i.e. data loss),
  5. none of the three potential down-sides (performance hit, data bloat, or data loss ) are mentioned in the documentation,
  6. most people still don’t have a good understanding of how encodings and collations work (even Microsoft talks about it incorrectly, in different ways, in each of the three documents where UTF-8 support is mentioned!)

All of that together — very few instances where UTF-8 collations should be used, several potential consequences that are not documented anywhere outside of this blog post, misleading information from Microsoft, and a general lack of understanding of how these things actually work — indicates (to me, at least) that it is far more likely that the UTF-8 collations will be used when they shouldn’t be, rather than when they should be.
 

Considering that the probability of failure is higher than the probably of success, I believe the better approach would be:

  1. for compatibility: make the documentation explicitly clear what the intended purpose of the UTF-8 collations are, and what the potential consequences can be. And stress that using them to reduce space taken for Unicode data might work, but requires a lot of testing for both performance and to ensure that the edge cases that lead to minor truncation are not occurring.
  2. for compression: get Unicode Compression working for NVARCHAR(MAX). This wouldn’t be limited to only helping ASCII characters. I don’t know what technical issues are getting in the way of implementing Unicode Compression for NVARCHAR(MAX) , but given how much will break (or will at least be clunky) in making UTF-8 work, I have a hard time believing that this far better alternative would have been any worse.

This assessment is nothing against UTF-8. It’s certainly the best option in many cases, especially at the OS level where there’s no opportunity for Unicode Compression. Nor is this to be taken as a criticism of the SQL Server team responsible for producing this feature. I think they do great work, and I love working with SQL Server. But, I think this particular feature could have used more research and planning, and as is the case for many features, would benefit from a lot more technical guidance, since it is far from self-explanatory.

BOTTOM LINE: do not use the UTF-8 collations unless your project fits into one of these two scenarios:

  1. You have a lot of data that is mostly standard ASCII (values 0 – 127), is stored in NVARCHAR(MAX) columns, and the table is not a good candidate for having a Clustered Columnstore Index.
  2. You have either legacy code and want to support Unicode, or code that works with other databases systems that support UTF-8, and you need to continue using VARCHAR datatypes and string literals that do not have an upper-case "N" prefix.

 


 

Basics

Documentation

Let’s start with what we are told about this new feature. According to the documentation, the new UTF-8 Collations:

  1. can be used …
    1. at the instance (i.e. server) -level, database-level, and column-level.
    2. in an expression via COLLATE.
    3. by appending "_UTF8" to the end of any Supplementary Character-Aware Collation (i.e. either having "_SC" in their name, OR being of level 140 or newer and not being binary)
    4. with only the CHAR and VARCHAR
  2. (implied) can be used, yet have no effect on, NCHAR and NVARCHAR data (meaning: for these types, the UTF-8 Collations behave the same as their non-UTF-8 equivalents
  3. “This feature may provide significant storage savings, depending on the character set in use.” (emphasis mine)

Installing

When installing SQL Server 2019, the default Collation (at least for systems with English as the OS-level language) is still "SQL_Latin1_General_CP1_CI_AS". Why, why, WHY???? Why not use "Latin1_General_100_CI_AS_SC", which has the same sensitivities but is newer, has more sort weights for many more characters, has more upper-case / lower-case mappings, supports Supplementary Characters, and is a Windows Collation (hence you don’t prevent seeks on indexed VARCHAR columns when compared to Unicode data). Sure, changing the default Collation after all these years will result in some headaches when needing to work with existing systems (that didn’t bother to change the default Collation), but it ensures that new projects started by folks who don"t know to change it will spread farther and farther throughout the galaxy, tightening its stranglehold on us (i.e. making it even harder to move away from). But that has nothing to do with UTF-8, so we can ignore it for now ( <grrrrr> , but please vote for Make Latin1_General_(100)_CI_AS the default collation for US English).

On the "Customize Database Engine Collation" screen, when selecting a Windows Collation:

  1. Collations without a version number (implied version "80") do not allow checking either the "Supplementary characters" or "Variation selector-sensitive" options.
  2. Collations with version numbers "90" or "100" allow for checking the "Supplementary characters" option, but not the "Variation selector-sensitive" option.
  3. Collations with version number "140" have the "Supplementary characters" option checked (and cannot be unchecked), and allow for checking the "Variation selector-sensitive" option.
  4. The UTF-8 option:
    1. can be selected for Windows collations that have "_90" (or higher) in the name, are not binary, and "Supplementary characters" has been checked.
    2. is automatically selected along with "Supplementary characters", and neither can be unselected, for Unicode-only collations ("Windows Code Page" will have a value of "0"). This option is required for Unicode-only collations (when used as the instance/server -level) as the non-UTF8, Unicode-only collations do not allow VARCHAR data.

Using

By executing the queries below, we can confirm that adding _UTF8 to the end of Supplementary Character-Aware Collations does produce valid Collation names. Notice how the version 90 and 100 collations also require the "_SC" flag, since unlike the version 140 (and someday higher) collations, they do not implicitly support Supplementary Characters (such support is a requirement for using the UTF-8 collations):

-- Version 90 (SQL 2005; _SC version of it started with SQL Server 2012)
SELECT 'a' COLLATE Japanese_90_CS_AS_SC_UTF8;

-- Version 100 (SQL 2008; _SC version of it started with SQL Server 2012)
SELECT 'a' COLLATE Latin1_General_100_CI_AS_SC_UTF8;

-- Version 140 (SQL 2017; no _SC version since it is implicitly supported)
SELECT 'a' COLLATE Japanese_XJIS_140_CS_AS_UTF8;

All of the above queries returned successfully.

 

The following query will list all of the UTF-8 Collations:

SELECT col.[name],
       COLLATIONPROPERTY(col.[name], 'CodePage') AS [CodePage],
       COLLATIONPROPERTY(col.[name], 'Version') AS [Version]
FROM   sys.fn_helpcollations() col
WHERE  col.[name] LIKE N'%[_]UTF8'
ORDER BY col.[name];
-- 1553 rows, all with CodePage = 65001

OR, possibly more directly, since UTF-8 is Windows Code Page 65001:

SELECT col.[name],
       COLLATIONPROPERTY(col.[name], 'Version') AS [Version]
FROM   sys.fn_helpcollations() col
WHERE  COLLATIONPROPERTY(col.[name], 'CodePage') = 65001 -- UTF-8
ORDER BY col.[name];
-- 1553 rows

 

There are 5508 total collations in SQL Server 2019, which is 1553 more than the 3955 that exist in SQL Server 2017. Hence, the only new collations are these "_UTF8" collations.

(Note: when I first published this post, there were only 5507 total collations in SQL Server 2019, 1552 of them being UTF-8 collations. This was due to there initially being no binary UTF-8 collation. More details in the Binary Collation section below.)

The following query shows that we can indeed use VARCHAR to store Unicode characters, but only when using one of the new UTF-8 Collations:

SELECT NCHAR(27581), -- 殽
       CONVERT(VARCHAR(3), NCHAR(27581)
                           COLLATE Latin1_General_100_CI_AS_SC), -- ?
       CONVERT(VARCHAR(3), NCHAR(27581)
                           COLLATE Latin1_General_100_CI_AS_SC_UTF8); -- 殽

The following query confirms that the new UTF-8 Collations cannot be used with the TEXT datatype:

SELECT CONVERT(TEXT, N'a' COLLATE Latin1_General_100_CI_AS_SC_UTF8);

returns:

Msg 4189, Level 16, State 0, Line XXXXX
Cannot convert to text/ntext or collate to 'Latin1_General_100_CI_AS_SC_UTF8' because these legacy LOB types do not support UTF-8 or UTF-16 encodings. Use types varchar(max), nvarchar(max) or a collation which does not have the _SC or _UTF8 flags.

Of course, the last part of that error message is misstated; it is ignoring the version 140 (and someday higher) collations, all of which fully support UTF-16 yet do not have "_SC" in their names. The following query:

SELECT CONVERT(TEXT, N'a' COLLATE Japanese_XJIS_140_CI_AS);

returns the same "Msg 4189" error, yet the collation name does not have either the "_SC" or "_UTF8" flags.

The following query shows that the NVARCHAR value is the same between the UTF-8 and non-UTF-8 Collation, and the same character can be held in VARCHAR, but only when using a UTF-8 Collation, and as a different underlying byte sequence:

DECLARE @Sample NVARCHAR(1) = NCHAR(0x5010);
SELECT @Sample; -- 倐

SELECT CONVERT(VARBINARY(2), @Sample
                    COLLATE Latin1_General_100_CI_AS_SC), -- 0x1050
       CONVERT(VARBINARY(2), @Sample
                    COLLATE Latin1_General_100_CI_AS_SC_UTF8), -- 0x1050

       CONVERT(VARCHAR(10), @Sample
                    COLLATE Latin1_General_100_CI_AS_SC), -- ?

       CONVERT(VARCHAR(10), @Sample
                    COLLATE Latin1_General_100_CI_AS_SC_UTF8), -- 倐
       CONVERT(VARBINARY(2), CONVERT(VARCHAR(10), @Sample
                    COLLATE Latin1_General_100_CI_AS_SC_UTF8)); -- 0xE580;

And, the following demonstrates that a UTF-8 Collation can be used at the database level:

CREATE DATABASE [UTF8] COLLATE Latin1_General_100_CI_AS_SC_UTF8;

ALTER DATABASE [UTF8] SET RECOVERY SIMPLE;

USE [UTF8];

The USE statement is there because the remaining examples that follow in this post will be executed from within the "UTF8" database.


 

General UTF-8 Info

UTF-16 (for context)

UTF-16, which is the Unicode encoding used for NVARCHAR data (including NCHAR, NTEXT and string data within the XML datatype), uses 16-bit (i.e. 2-byte) values, known as “code units”, to represent characters. These “code units” are used individually (a 2-byte code unit) for BMP characters, or in pairs (2 two-byte code units) for Supplementary Characters, to map to individual “code points”. UCS-2, which only maps the BMP characters (i.e. the first 65,536 code points), is a fixed-length encoding because it only deals with single code units. UTF-16, which uses the exact same code units as UCS-2, can use certain combinations of two of those code units (called Surrogate Pairs) to map the non-BMP characters (i.e. code points 65,537 and above, known as Supplementary Characters).

Because these values are two bytes, they are subject to “endianness”, which refers to the order in which the individual bytes that make up the value are stored internally (in memory, on disk, etc). The hardware determines the “endianness” of how multi-byte values (including INT, SMALLINT, BIGINT, etc) are stored, and the two possibilities are “Big Endian”, which is the given / natural ordering (i.e. value 0x1234 — where 0x12 is byte #1, and 0x34 is byte #2 — is stored as 0x1234), or “Little Endian”, which refers to storing the bytes in reverse-order (i.e. value 0x1234 is stored as 0x3412).

Unicode strings in Windows, “string” in .NET, and NVARCHAR data in SQL Server, are all UTF-16 Little Endian.

UTF-8

UTF-8 is another Unicode encoding which uses between one and four 8-bit (i.e. 1-byte) code units to represent characters. Even though it is Unicode, due to working 8-bits at a time, this encoding is considered VARCHAR data. This is also a variable-length encoding because it requires anywhere from 1 to 4 code units to map to any given code point. Due to these code units being single bytes (even in combinations of 2, 3, or 4), there is no concept of “endianness”.

One main advantage of UTF-8 is that it’s fully compatible with any existing data that’s comprised entirely of 7-bit ASCII characters (values 0 – 127). This is by design, but not really a reason for SQL Server to use it for storing data internally. The other main advantage of UTF-8 is that it can take up half the space while still allowing for the full range of Unicode characters. This would be a reason for SQL Server to use it internally. BUT, it only takes up half the space for a small subset of characters. Of course, if the characters in this small subset that are only 1 byte are the vast majority of characters are being used, then this encoding does provide for significant space savings (in terms of disk space, memory, and reduced network traffic). What people tend to overlook, though, is that most Unicode characters require 3 bytes, which is 1 more byte than UTF-16 for those same characters.

You can execute the following query in SQL Server 2019 to see how many bytes each character requires for both UTF-8 and UTF-16 encodings. It returns all 65,536 BMP (Base Multilingual Plan) characters (which is also the entire UCS-2 character set), and 3 Supplementary Characters. Since all Supplementary Characters are 4 bytes in both encodings, there is no need to return more of them, but we do need to see a few of them to see that they are a) all 4 bytes, and b) encoded slightly differently. The query will work the same regardless of whether or not the current database uses a "_UTF8" Collation. It also does not matter if the database’s default Collation is Supplementary Character-Aware or not.

;WITH nums([CodePoint]) AS
(
  SELECT TOP (65536) (ROW_NUMBER() OVER (ORDER BY (SELECT 0)) - 1)
  FROM   [master].[sys].[columns] col
  CROSS JOIN [master].[sys].[objects] obj
), chars AS
(
  SELECT nums.[CodePoint],
         CONVERT(VARCHAR(4), NCHAR(nums.[CodePoint])
                    COLLATE Latin1_General_100_CI_AS_SC_UTF8) AS [TheChar],
         CONVERT(VARBINARY(4), CONVERT(VARCHAR(4), NCHAR(nums.[CodePoint])
                    COLLATE Latin1_General_100_CI_AS_SC_UTF8)) AS [UTF8]
  FROM   nums
  UNION ALL
  SELECT tmp.val,
         CONVERT(VARCHAR(4), CONVERT(NVARCHAR(5), tmp.hex)
                    COLLATE Latin1_General_100_CI_AS_SC_UTF8) AS [TheChar],
         CONVERT(VARBINARY(4), CONVERT(VARCHAR(4), CONVERT(NVARCHAR(5),
              tmp.hex) COLLATE Latin1_General_100_CI_AS_SC_UTF8)) AS [UTF8]
  FROM   (VALUES (65536, 0x00D800DC), -- Linear B Syllable B008 A (U+10000)
                 (67618, 0x02D822DC), -- Cypriot Syllable Pu (U+10822)
                 (129384,0x3ED868DD)  -- Pretzel (U+1F968)
         ) tmp(val, hex)
)
SELECT chr.[CodePoint],
       COALESCE(chr.[TheChar], N'TOTALS:') AS [Character],
       chr.[UTF8] AS [UTF8_Hex],
       DATALENGTH(chr.[UTF8]) AS [UTF8_Bytes],
       COUNT(CASE DATALENGTH(chr.[UTF8]) WHEN 1 THEN 'x' END) AS [1-byte],
       COUNT(CASE DATALENGTH(chr.[UTF8]) WHEN 2 THEN 'x' END) AS [2-bytes],
       COUNT(CASE DATALENGTH(chr.[UTF8]) WHEN 3 THEN 'x' END) AS [3-bytes],
       COUNT(CASE DATALENGTH(chr.[UTF8]) WHEN 4 THEN 'x' END) AS [4-bytes],
      ---
      CONVERT(VARBINARY(4), CONVERT(NVARCHAR(3), chr.[TheChar]))
         AS [UTF16(LE)_Hex],
      DATALENGTH(CONVERT(NVARCHAR(3), chr.[TheChar])) AS [UTF16_Bytes],
      ---
      ((DATALENGTH(CONVERT(NVARCHAR(3), chr.[TheChar]))) -
            (DATALENGTH(chr.[TheChar]))) AS [UTF8savingsOverUTF16]
FROM   chars chr
GROUP BY ROLLUP ((chr.[CodePoint], chr.[TheChar], chr.[UTF8]));

There isn’t enough room here to show all 65,538 rows, so here’s a summary of the info:

Code Points Code Point Quantity UTF-8 Size (bytes) UTF-16 Size (bytes) UTF-8 savings
0 – 127 128 1 2 1
127 – 2047 1920 2 2 0
2048 – 65535 63488 3 2 -1
65536 – lots 4 4 0

As you can see, only the first 128 Code Points offer any size savings over UTF-16. The characters in that range are (minus the first 32 which are control characters, and the last character):

[space] ! " # $ % & ' (
) * + , . / 0 – 9 :
; < = > ? @ A – Z [ \
] ^ _ ` a – z { | } ~

Those characters, and their values (32 – 126), are exactly the same as the standard ASCII character set, and are the same across many code pages (certainly all code pages supported by SQL Server). Any characters not in the chart above do not offer any space savings, with most of the BMP characters actually taking up more space than they would if using UTF-16 (i.e. NVARCHAR ).

Byte Sequences

The chart below shows the possible byte sequences for various ranges of code points when encoding as UTF-8. As you can see, only certain bytes are ever used as a starting byte. And, the starting byte even indicates the possible number and ranges for any possible additional bytes. Various points of interest are noted below the chart.

Code Point Range Byte #1 Byte #2 Byte #3 Byte #4
U+0000 – U+007F 00 – 7F      
 
U+0080 – U+07FF C2 – DF 80 – BF    
 
U+0800 – U+0FFF E0 A0 – BF 80 – BF  
U+1000 – U+CFFF E1 – EC 80 – BF 80 – BF  
U+D000 – U+D7FF ED 80 – 9F 80 – BF  
U+E000 – U+FFFF EE – EF 80 – BF 80 – BF  
 
U+10000 – U+3FFFF F0 90 – BF 80 – BF 80 – BF
U+40000 – U+FFFFF F1 – F3 80 – BF 80 – BF 80 – BF
U+100000 – U+10FFFF F4 80 – 8F 80 – BF 80 – BF

Points of interest regarding the various UTF-8 byte sequences:

  1. Bytes 0x00 – 0x7F are standard ASCII and have no additional bytes.
  2. Bytes 0xC2 – 0xF4 are starting bytes for multi-byte sequences and are never used as additional bytes.
  3. Bytes 0x80 – 0xBF are only ever used as additional bytes.
  4. Bytes 0xC0, 0xC1, 0xF5 – 0xFF are never used.
  5. All two-byte code points have a starting byte in the 0xC2 – 0xDF range.
  6. All three-byte code points have a starting byte in the 0xE0 – 0xEF range.
  7. All four-byte code points (i.e. Supplementary Characters) have a starting byte in the 0xF0 – 0xF4 range.
  8. Bytes 3 and 4, when used, are always in the 0x80 – 0xBF range.
  9. Byte 2 is the only byte that sometimes deviates from the 0x80 – 0xBF range, but when it does, it is always a subset of that range.

Setting and Storing UTF-8 Values

An encoding is how the characters are represented internally. Whether the source is a non-Unicode string, a Unicode string, or a binary string (i.e. series of bytes / hex bytes), the end result will be the chosen encoding’s value for that character.

First, let’s try setting a variable, in the "[UTF8]" database, from a VARBINARY literal:

DECLARE @HotDog VARCHAR(4) = 0xF09F8CAD;
PRINT @HotDog; -- ( U+1F32D )

Returns:   &#x1F32D; (in the "Messages" tab)

As you can see, it produces the expected character from the given 4 bytes. However, those same 4 bytes do not produce the correct character when stored as NVARCHAR , which expects UTF-16 Little Endian bytes sequences:

DECLARE @NotHotDog NVARCHAR(4) = 0xF09F8CAD;
PRINT @NotHotDog; -- ( U+1F32D )
-- 鿰권

Now let’s look at how the size of the (N)VARCHAR types impact what can be stored in them:

SELECT CONVERT( VARCHAR(3), 0xF09F8CAD) AS [TooSmallForUTF8-HotDog],
     DATALENGTH(CONVERT(VARCHAR(3), 0xF09F8CAD)) AS [ByteCountForTooSmall],
       CONVERT( VARCHAR(4), 0xF09F8CAD) AS [UTF8-HotDog],
       DATALENGTH(CONVERT(VARCHAR(4), 0xF09F8CAD)) AS [ByteCountForHotDog],
       ------
       CONVERT(NVARCHAR(1), 0x3CD82DDF) AS [TooSmallForUTF16-HotDog],
       CONVERT(NVARCHAR(2), 0x3CD82DDF) AS [UTF16-HotDog];

Returns:

TooSmallFor
UTF8-HotDog
ByteCountFor
TooSmall
UTF8-
HotDog
ByteCountFor
HotDog
TooSmallFor
UTF16-HotDog
UTF16-
HotDog
0 🌭 4 🌭

The first column appears to return nothing, not even a partial, incorrect / unknown character. The second column confirms that there is absolutely nothing resulting from attempting to store a 4-byte UTF-8 value in a VARCHAR type that is only 3 bytes. It couldn’t even take just the first 3 bytes of the 4-byte value. The third and fourth columns show that given the type the required 4 bytes allows everything to work as expected.

On the other hand, when attempting to store the same character, using the UTF-16 4-byte sequence, into an NVARCHAR type that is too small, the result is the default Replacement Character ( U+FFFD ). This is because it was able to store the first half of the Surrogate Pair (the 0x3CD8), which is meaningless on its own, but it is still a valid byte sequence.


 

Size and Performance

{ This section is taking a while as I figure out how to best present the variety of tests and their results, but the general info is all there. }

Data Profiles

  1. Some ASCII (values 1 – 254), but mostly non-ASCII
    • [N]VARCHAR(500)
    • 10,539 rows
    • Characters per row: MIN = 1; AVG = 14; MAX = 30
  2. Some ASCII MAX
    • [N]VARCHAR(MAX)
    • 10,539 rows
    • Characters per row: MIN = 1; AVG = 14; MAX = 30
       
  3. All ASCII
    • [N]VARCHAR(500)
    • 105,390 rows
    • Characters per row: MIN = 6; AVG = 24; MAX = 60
  4. All ASCII MAX (all in row)
    • [N]VARCHAR(MAX)
    • 105,390 rows
    • Characters per row: MIN = 6; AVG = 24; MAX = 60
       
  5. All ASCII MAX (10k characters per row; all off-row)
    • [N]VARCHAR(MAX)
    • 10,539 rows
    • Characters per row: MIN = 10,007; AVG = 10,025; MAX = 10,061
  6. All ASCII MAX (5k characters per row; mixed in / off -row; UTF-8 = in-row, NVARCHAR = off-row)
    • [N]VARCHAR(MAX)
    • 21,078 rows
    • Characters per row: MIN = 5,007; AVG = 5,025; MAX = 5,061

Tests:

These were executed in a database which has a "_UTF8" default Collation.

SELECT REVERSE([Value])
FROM dbo.[{table}];


SELECT SUBSTRING([Value], 5, 13)
FROM dbo.[{table}];


SELECT [Value]
FROM dbo.[{table}]
WHERE [Value] = 'abcdef12345'; -- UTF16 test uses N'abcdef12345'


SELECT [Value]
FROM dbo.[{table}]
WHERE [Value] LIKE '%123%'; -- UTF16 test uses N'%123%'


SELECT MIN(LEN([Value])), AVG(LEN([Value])), MAX(LEN([Value]))
FROM   dbo.[{table}];


SELECT [Value]
FROM   dbo.[{table}];

Each test was wrapped in:

SET STATISTICS TIME ON;
...{test}...
SET STATISTICS TIME OFF;

Results

  1. Some ASCII in non-MAX types
    • NVARCHAR is smaller here when Compression = NONE, ROW, or PAGE. COLUMNSTORE estimated better as well, but not tested.
    • Did not test performance given no space savings
  2. Some ASCII in MAX types (in-row)
    • NVARCHAR is smaller here when Compression = NONE, ROW, or PAGE. COLUMNSTORE estimated the same, but not tested.
    • Did not test performance given no space savings
       
  3. All ASCII in non-MAX types
    • UTF-8 is smaller here when Compression = NONE (approx. 37% smaller than NVARCHAR)
    • UTF-8 is only slightly smaller here when Compression = ROW or PAGE (approx. 1% smaller)
    • NVARCHAR is slightly smaller here when table is COLUMNSTORE (approx. 1% smaller)
    • Performance of NVARCHAR is same or better than UTF-8
  4. All ASCII in MAX types (in-row)
    • UTF-8 is smaller here when Compression = NONE (approx. 37% smaller than NVARCHAR; 40% for ROW and PAGE)
    • UTF-8 is only slightly smaller here when table is CLUSTERED COLUMNSTORE (approx. 9% smaller)
    • Performance is sometimes the same, but NVARCHAR is usually better than UTF-8
  5. All ASCII in MAX types (10k; off-row)
    • UTF-8 is smaller here when Compression = NONE (approx. 46% smaller)
    • ROW and PAGE compression do not apply to off-row data
    • UTF-8 is only slightly smaller here when table is CLUSTERED COLUMNSTORE (approx. 1.3% smaller). The NVARCHAR table is still 98% smaller than it was originally.
    • With no compression, NVARCHAR is faster for all tests.
    • When table is CLUSTERED COLUMNSTORE, NVARCHAR is faster for all tests except "SUBSTRING" and "=".
  6. All ASCII in MAX types (5k; mixed on / off-row: VARCHAR = on, NVARCHAR = off)
    • UTF-8 is smaller here when Compression = NONE, ROW, or PAGE (approx. 25% smaller)
    • UTF-8 is much smaller here when table is CLUSTERED COLUMNSTORE (approx. 92% smaller). BUT, this is typically a non-issue since the NVARCHAR table is still 96% smaller than it was originally. This becomes an issue as the uncompressed data gets closer to 100 GB.
    • With no compression, NVARCHAR is faster for all tests except "LIKE '%x%'".
    • When table is CLUSTERED COLUMNSTORE, UTF-8 is faster for all tests except “REVERSE” (elapsed time was the same, but CPU was better)

Typically there is at least a slight performance hit when using UTF-8. I believe the issue is due to SQL Server still using UTF-16 to process most (but not all) operations internally. This means that there is a lot of converting back and forth between UTF-8 and UTF-16.

The area of biggest improvement of UTF-8 over NVARCHAR is when there is between 4001 and 8000 bytes of mostly standard ASCII characters. In this scenario, you would need to use NVARCHAR(MAX) to get over 4000 bytes, but then you lose the ability to use Unicode Compression. And, being over 4000 bytes, the NVARCHAR data must be stored off-row (a slight performance hit), while that same data can be stored in-row when using UTF-8. This is the only scenario where UTF-8 is better for both space and performance.



 

Real Purpose

On December 18th, 2018, new information was provided about the UTF-8 collations, and included a reason for this new feature that had not been mentioned before:

scenarios where legacy applications require internationalization and use inline queries: the amount of changes and testing involved to convert an application and underlying database to UTF-16 can be costly

What that means is, existing applications using VARCHAR data will have that data stored in VARCHAR columns, use VARCHAR variables and parameters, and will have string literals that do not have an upper-case "N" prefix. Supporting Unicode would previously require changing the datatypes of columns, variables, and parameters to NVARCHAR, and string literals would need to be prefixed with an upper-case "N". It could take a significant amount of time to make and test those changes. By adding the UTF-8 collations, it will be possible to skip most of that work. Ok, sounds reasonable.

Of course, what is not being said is that there is still a significant amount of work entailed in changing the collations of ALL string columns, especially where indexes, foreign keys, defaults, check constraints, computed columns, etc are involved. In those cases you need to drop the dependent object(s) so that the column can be updated with the new collation. While there is an undocumented short-cut to all of that work, there is still A LOT of testing to do. I’m sure some people (maybe many?) will test with only their existing data and test cases, and that will be a huge mistake. Only the first 128 characters take up a single byte. But, the whole point of switching to UTF-8 is to support Unicode, so we need to assume that there will be characters being stored that were not available in the one code page that was previously supported. Not only will all of those new characters take up more than 1 byte, but even the second 128 characters from the original code page (which previously only required a single byte) will now require either 2 or 3 bytes (please see Item 2 under “Things to Keep in Mind: Operational” below). This means that a VARCHAR(20) column for a first or last name is no longer guaranteed to always store up to 20 characters. As UTF-8, it will now store anywhere from 5 to 20 characters, depending on how many bytes each character requires.

Current Assessment

Knowing that the main reason for introducing the UTF-8 collations is for compatibility certainly helps. This is a great option for those who truly need it. Unfortunately, it’s quite likely that this feature will be misused, rather than used such that it provides an actual benefit to the project, more often than not. I say this because a lot of people have heard about UTF-8, but not that many fully understand it (the topic of collations, in general, is confusing to most people). The Microsoft documentation is a perfect example: the two official documentation pages don’t even mention compatibility (it’s only mentioned in a blog post), and what is said about the potential for saving space (in two places) is misleading. For example, in the blog post, it states:

if your dataset uses primarily Latin characters, significant storage savings may also be achieved as compared to UTF-16 data types

That statement is inaccurate: it’s only the standard ASCII characters that provide any space savings. Yes, those characters are Latin characters, but they’re not the only Latin characters. There are 128 other characters on most code pages (a few have more), and they are all 2 or 3 bytes each in UTF-8 (even the ones on code page 1252, which is Latin1_General). You can see for yourself using the example code below:

-- DROP TABLE #Space;
CREATE TABLE #Space
(
  [CodePoint] TINYINT NOT NULL,
  [CP1252] VARCHAR(5) COLLATE Latin1_General_100_CI_AS_SC NOT NULL,
  [UTF8] VARCHAR(5) COLLATE Latin1_General_100_CI_AS_SC_UTF8,
  [UTF8bytes] AS (DATALENGTH([UTF8]))
);

;WITH cte AS
(
  SELECT TOP(128) ROW_NUMBER() OVER (ORDER BY @@SPID) + 127 AS [num]
  FROM   sys.all_columns
)
INSERT INTO #Space ([CodePoint], [CP1252])
  SELECT cte.[num], CONVERT(BINARY(1), cte.[num])
  FROM   cte;


-- SELECT * FROM #Space;

UPDATE #Space
SET [UTF8] = [CP1252];

SELECT *
FROM #Space
ORDER BY [UTF8bytes], [CodePoint];
-- 111 are 2 bytes
--  17 are 3 bytes

Of course, if your data is all standard ASCII then you will get the full extent of the space savings (often at the cost of performance), but that statement implies that it is all of code page 1252 that is eligible for the space savings, and that just isn’t true.

Then, the “What’s New in SQL Server 2019” page states:

This feature may provide significant storage savings, depending on the character set in use.

“Character set” is synonymous with “code page”. However, which code page is being used has nothing to do with the potential for saving space (when moving to UTF-8) since the only characters that are a single byte are the same 128 characters that are the same across all code pages (at least all code pages available in Windows / SQL Server).

THE POINT IS: if Microsoft cannot communicate why the UTF-8 collations exist, what they do, when they should and should not be used, etc, then what chance do most users have of figuring out if they should use them or not?

Given the great potential for misusing this feature, I would have preferred there to have been more time spent in planning and documenting before releasing these new collations. And, either don’t mention the space-saving potential of UTF-8, or at least mention the restrictions and consequences of using them. And of course, handle the saving-space issue as an entirely separate project.


 

In-memory OLTP

The documentation states that the "_UTF8" Collations do not work with In-memory OLTP. We can easily confirm that via the following tests.

This:

CREATE TABLE dbo.UTF8
(
    [ID] INT IDENTITY(1, 1)
        NOT NULL
        PRIMARY KEY NONCLUSTERED HASH WITH (BUCKET_COUNT = 400),
    [Name] VARCHAR(50) COLLATE Latin1_General_100_CI_AS_SC_UTF8
        NOT NULL
)
WITH (MEMORY_OPTIMIZED = ON);

and this:

CREATE TABLE dbo.UTF8
(
    [ID] INT IDENTITY(1, 1)
        NOT NULL
        PRIMARY KEY NONCLUSTERED HASH WITH (BUCKET_COUNT = 400),
    [Name] VARCHAR(50) COLLATE Latin1_General_100_CI_AS_SC -- NOT UTF8
        NOT NULL
)
WITH (MEMORY_OPTIMIZED = ON);
-- Success

ALTER TABLE dbo.[UTF8]
    ALTER COLUMN [Name]
    VARCHAR(50) COLLATE Latin1_General_100_CI_AS_SC_UTF8 NOT NULL;

both get the following error:

Msg 12356, Level 16, State 157, Line XXXXX
Comparison, sorting, and manipulation of character strings that use a UTF-8 collation is not supported with memory optimized tables.

And this statement (same as the first CREATE statement directly above, but including an INDEX on the [Name] column):

CREATE TABLE dbo.UTF8
(
    [ID] INT IDENTITY(1, 1) NOT NULL
        PRIMARY KEY NONCLUSTERED HASH WITH (BUCKET_COUNT = 400),
    [Name] VARCHAR(50) COLLATE Latin1_General_100_CI_AS_SC_UTF8
        NOT NULL INDEX [IX_UTF8_Name] NONCLUSTERED
)
WITH (MEMORY_OPTIMIZED = ON);

gets the following error:

Msg 12357, Level 16, State 158, Line XXXXX
Indexes on character columns that use a UTF8 collation are not supported with indexes on memory optimized tables.

And, attempting to change a Database’s default Collation to be a "_UTF8" Collation will error:

ALTER DATABASE [Hekaton] COLLATE Latin1_General_100_CI_AS_SC_UTF8;

receives:

Msg 41335, Level 16, State 3, Line XXXXX
Modifying the collation of a database is not allowed when the database contains memory optimized tables or natively compiled modules.

(that error message is not entirely accurate since you can change the Collation to a non-UTF-8 Collation, though you still might need to drop natively compiled modules first.)

HOWEVER, using the "sqlservr -q" method of changing all Collations for an Instance allows us to force a UTF-8 Collation on a Database containing memory-optimized tables, and even on columns in those memory-optimized tables:

sqlservr.exe -c -m -T4022 -T3659 -s"{instance_name}" ^
-q"Latin1_General_100_CI_AS_SC_UTF8"

After this (unsupported) change, the Database generally works, but there are a few definite issues:

  1. Data corruption in VARCHAR columns for characters with values 128 and above (i.e. outside of the standard ASCII set that ends at value 127). This affects "durable" memory-optimized tables (i.e. DURABILITY = SCHEMA_AND_DATA). For example, if the original Collation is Korean_100_CI_AS, then the byte sequence of "0xB15A" will produce "켣". If you then use this method to switch to any "_UTF8" Collation, you will end up seeing "�Z" because "0xB1" is an invalid byte sequence in UTF-8, and "0x5A" is valid and is the "Z" character.
  2. Indexes on memory-optimized tables will not be dropped and recreated, and so in many cases they will be out of order (i.e. in the wrong order). For example, switching from Latin1_General_100_CI_AS_SC to Latin1_General_100_CI_AS_SC_UTF8 (only difference is adding "_UTF8" to the end of the original Collation) would cause indexes to become out of order, but only if they have key columns that are VARCHAR with data containing characters of values 128 or higher. (INCLUDE columns that are VARCHAR will have corrupted data for characters of values 128 or higher, but this does not affect the order.)
  3. Related to out-of-order indexes: a simple SELECT works, even when adding in ORDER BY or GROUP BY. Adding a WHERE clause will also work, as long as no new rows have been added. Once a new row has been added, using a WHERE clause referencing an indexed string column will result in the following error:
    Msg 9100, Level 21, State 2, Line XXXX
    Possible index corruption detected. Run DBCC CHECKDB.

    And you will be disconnected. OR, after adding a row, you might be able to successfully use a WHERE clause, but the first time you use the ORDER BY clause you will get the following error:
    Msg 701, Level 17, State 153, Line XXXXX
    There is insufficient system memory in resource pool 'default' to run this query.

    After that, you will probably get the “Possible index corruption detected” error, even when using just the WHERE clause.

Keep in mind that indexes on memory-optimized tables cannot be rebuilt or even dropped. Fixing these issues requires dropping, recreating, and reloading any affected tables. Again, this (using UTF-8 Collations with memory-optimized tables) is not a supported configuration.

Of course, using a UTF-8 collation with an NVARCHAR column does work in a Memory-Optimized table:

CREATE TABLE dbo.UTF8
(
    [ID] INT IDENTITY(1, 1) NOT NULL
        PRIMARY KEY NONCLUSTERED HASH WITH (BUCKET_COUNT = 400),
    [Name] NVARCHAR(50) COLLATE Latin1_General_100_CI_AS_SC_UTF8
        NOT NULL
)
WITH (MEMORY_OPTIMIZED = ON);

This makes sense as NVARCHAR is always UTF-16 regardless of the collation.

Memory-Optimized TempDB Metadata

Another really awesome feature introduced in SQL Server 2019 is the ability to use In-Memory OLTP for TempDB metadata. Doing this should greatly reduce contention on high-volume temp object creation / deletion. For full details, please see the following documentation: Memory-Optimized TempDB Metadata.

Now, the limitations noted in the documentation never mention if the instance-level collation being a UTF-8 collation (as that determines the collation of [tempdb] and its metadata) is a problem or net. Yet, we know that we cannot use UTF-8 collations in Memory-Optimized tables. So, if we are using a UTF-8 at the instance (i.e. server) -level, can we enable this super cool [tempdb] turbo button? Let’s find out!

We can check to see if the feature is enabled via the following query:

SELECT SERVERPROPERTY('IsTempdbMetadataMemoryOptimized');

If that returns a 0, then the feature is not enabled, and we can enable it via the following query (and a restart of the instance):

ALTER SERVER CONFIGURATION SET MEMORY_OPTIMIZED TEMPDB_METADATA = ON;

This change requires that the SQL Server instance service be restarted. Upon restart, you can execute that check query again to verify that the option has been enabled.

Using SQL Server 2019 RTM and an instance-level collation of Latin1_General_100_BIN2_UTF8, I was able to enable this feature and create a temp table (one with a VARCHAR column using a UTF-8 collation):

-- DROP TABLE #TempTableTest;
CREATE TABLE #TempTableTest ([Col1] VARCHAR(10)
                                COLLATE Latin1_General_100_CI_AS_SC_UTF8);
INSERT INTO #TempTableTest VALUES ('hello');
INSERT INTO #TempTableTest VALUES (NCHAR(0x20BE));
INSERT INTO #TempTableTest VALUES (NCHAR(0xD808) + NCHAR(0xDD10));
SELECT * FROM #TempTableTest;
/*
hello
₾
𒄐
*/

As you can see in the test above, both system data (the table itself) and the user data (data in the table) work just fine with UTF-8 collations at both levels and the “Memory-Optimized TempDB Metadata” feature enabled.

This also makes sense as the system meta-data should all be NVARCHAR and hence not impacted by the 8-bit encoding associated with any collation, UTF-8 or otherwise.


 

Binary Collation for UTF-8

CTP 2.0 did not come with a binary UTF-8 collation. Why might we need / want one? Well, here are some reasons:

  1. Microsoft’s implementation of Unicode has incomplete, and some incorrect, sort weights. One example:
    Choosing a binary collation that can differentiate between ‘ss’ and ‘ß’ for nvarchar column in Sql Server
  2. When needing to know if two values are 100% identical, you need a binary Collation (such as when auditing for changes, etc). Even if all characters had correct sort weights, you still wouldn’t be able to differentiate between a base character plus a combining diacritic and the pre-composed single character equivalent. Nor would you be able to differentiate between a base character plus multiple combining diacritics and the same characters with a different order for the diacritics. In both of those cases, the visual representation of the “character” is the same, but the number or order of the underlying Code Points would be different. The first item from the following post provides a working example: No, Binary Collations are not Case-Sensitive
  3. When storing alphanumeric codes (e.g. airline confirmation codes, phone numbers, postal codes, etc), binary sorting and comparison is much faster since it ignores all linguistic rules, which is perfect for scenarios such as these where there is no expectation of applying such rules.
  4. Finding / removing "(null)" character U+0000 / CHAR(0)

So, I filled the following Feedback suggestion: There are no binary Collations that support the UTF8 encoding!!! Why not?

CTP 2.3 introduced the long-awaited BIN2 collation: UTF8_BIN2. This was a definite improvement over not having one, but there were a few issues that absolutely needed to be addressed.

Good

  1. It exists (Yay!)
  2. No BIN collation to confuse people with.
  3. Only one BIN2 collation; no BIN2 per each culture, which would have added no benefit and likely would have confused people wondering what the difference was between the various UTF8_BIN2 collations.

Not So Good

  1. Version 80 used instead of 140. This is a problem because version 80 is missing 649 uppercase / lowercase mappings that version 140 collations have. This affects the accuracy of the UPPER and LOWER functions.
  2. No version number in the name, so inconsistent with all other Windows collations.
  3. No culture name, so inconsistent with all other collations. Should be something like "Invariant" or "General", or even "Latin1_General_version#_" if need be (since the UTF8_BIN2 collation does have an LCID of 1033, which is "Latin1_General").

    This is a MAJOR problem, actually. On the surface it appears to be nothing more than a different way to name the collation. But, there is an issue that is hidden by the fact that the version 80 collations don’t include the version number (i.e. "80") in the collation name. The last major update of collations was the version 100 collations that introduced many new collations as well as used a newer version of Unicode. Newer versions of Unicode have more sort weights defined, more upper-case / lower-case mappings defined, and some fixes to previous weights and/or mappings that might not have been correct. The version 100 collations were introduced in SQL Server 2008. SQL Server 2017 did introduce some version 140 collations with an even newer version of Unicode, but that was for only the Japanese collations. At some point, Microsoft is going to need to bring the rest of the version 100 collations up to the 140 level (or beyond). When that happens, all of the new collations will have the new version number in their names. That’s fine for all collations except UTF8_BIN2. SQL Server 2019, which is version 150, didn’t introduce new collations with newer Unicode sort weights (the UTF-8 collations are just the existing collations that do a different 8-bit encoding). So, assuming the next version of SQL Server does upgrade the Unicode info, that would be version 160. Since there is already a UTF8_BIN2 collation that isn’t going to have its definition changed, a new collation will be added, and the naming convention would make it: 160_UTF8_BIN2. That might look ok, but identifiers (i.e. names of things in the system), cannot begin with a number. This will be quite the difficult situation once somebody attempts to create this new collation.

Final Resolution

CTP 3.0 finally resolved 99% of the issues related to UTF-8 binary collations. The binary collation was updated to be: Latin1_General_100_BIN2_UTF8.

  1. Version is 100 instead of 80, so many more upper-case / lower-case mappings (even if not as many as version 140)
  2. Name is now prefixed with a culture name, which is not only consistent with all other collations, but also avoids the problem is ending up with a version of the UTF-8 binary collation that begins with a number (again, that would not be a valid identifier).

Only remaining issue is that a second binary collation is needed for the Turkish locale / culture as the upper-case / lower-case mappings for i and I are not the same as in all other cultures. I submitted a feedback suggestion for this:

UTF-8 in SQL 2019: Add Turkish_100_BIN2_UTF8 collation to handle Turkish / Azeri case mappings for "i" and “I”

Finally, there’s a rather minor issue in that the new Latin1_General_100_BIN2_UTF8 collation does not work with the NTEXT datatype. This is a problem because the new UTF-8 collations are not supposed to affect NVARCHAR / UTF-16 data, and binary collations do not fully support / recognize Supplementary Characters, yet that’s the error message one gets when attempting to use Latin1_General_100_BIN2_UTF8 with NTEXT. Of course, the NTEXT datatype has been deprecated since the release of SQL Server 2005, so this is likely a rather low priority. Still, I filled a bug report for it here:

UTF-8 in SQL 2019: binary collation ( Latin1_General_100_BIN2_UTF8 ) should work with NTEXT instead of getting Msg 4189, Level 16 error


 

Things to Keep in Mind

General

  1. This is a storage format; operations requiring linguistic rules and/or normalization are still handled in UTF-16 which requires converting in memory each time.
  2. Despite what you have probably read and/or heard, UTF-8 was not designed to save space. It was designed to have ASCII-compatibility while also supporting Unicode. Sure, it might save space in various scenarios, but it might also hurt perfomance, allow for some data-loss, or all for minor truncation.
  3. Operations requiring in-memory conversion to UTF-16 are slower (generally) than using NVARCHAR
  4. There are still a few features that do not work at all with UTF-8 collations:
    1. In-memory OLTP   (NOTE: For clarification based on testing, please see "In-memory OLTP" section)
    2. “There is currently no UI support to choose UTF-8 enabled collations in Azure Data Studio or SQL Server Data Tools (SSDT).” (according to the Release Notes)
    3. Always Encrypted when using with Secure Enclaves
    4. In SQL Server Management Studio (SSMS), prior to version 18, the UTF-8 collations are not available in drop-down lists (and this won’t be fixed; upgrade your SSMS)
  5. If the goal is simply to reduce space taken up for Unicode data, then it would be far better to expand data compression to work with NVARCHAR(MAX) (ideally including off-row data, but even if only working with in-row data) because that would:
    1. apply to many more use-cases
    2. do better data compression than the best-case for UTF-8 (reduces space for more than just standard ASCII characters)
    3. not have the performance hit
    4. not require any re-coding of apps
    5. not introduce as many bugs
    6. would not have the potential for customers to accidentally bloat their systems or degrade performance without even getting any space savings
       
      Please vote for the following suggestion:
      unicode compression nvarchar(max)” — this would save more space than UTF-8

Operational

Here are some operational quirks that you might run into when using a UTF-8 Collation. These are most likely not bugs (if they are related to features), but either way, they are not likely to be fixed. Still, it’s possible that you have legacy systems and/or 3rd party applications (some that you cannot change) using these features.

  1. Data loss from Mixing UTF-8 string literals and/or variables (due to the current database having a UTF-8 default collation) and non-UTF-8 VARCHAR columns:

    When mixing Unicode and non-Unicode data, you usually expect the non-Unicode values to be converted to Unicode and the resulting value to be Unicode such that there is no data loss of Unicode-only characters. Prior to SQL Server 2019 introducing the UTF-8 collations, this was always the case due to Datatype Precedence converting VARCHAR into NVARCHAR (since NVARCHAR was the only way to have Unicode data). But now, thanks to the UTF-8 collations, we can store Unicode data as VARCHAR (which is the reason the UTF-8 collations were added). BUT, if we have two VARCHAR values, one Unicode and one non-Unicode, then Datatype Precedence never comes to the rescue because the datatype is the same for both values.

    Now it is only the collation that’s different, so Collation Precedence takes over. When a column is used with, or compared to, a literal or variable having a different collation, the collation of the literal or variable will be changed to match the collation of the column. And, just like with any other code page conversion, any character in the source code page (the literal or variable in this case) that is not in the target code page (the column in this case) will either be converted to a similar character IF a “best fit” mapping exists for the source character, or it will be converted into the default replacement character, which in most cases is a question mark "?".

    DECLARE @DataLoss TABLE
    (
      [CP1252] VARCHAR(50) COLLATE Latin1_General_100_CI_AS,
      [UTF8] VARCHAR(50) COLLATE Latin1_General_100_CI_AS_SC_UTF8
    );
    
    DECLARE @Test VARCHAR(5) = NCHAR(0xD861) + NCHAR(0xDD77);-- 𨕷 (U+28577)
    
    INSERT INTO @DataLoss ([CP1252], [UTF8]) VALUES ('??', @Test);
    
    SELECT [UTF8], [UTF8] + '𨕷' FROM @DataLoss WHERE [UTF8] = @Test;
    -- 𨕷    𨕷𨕷
    
    SELECT [CP1252], [CP1252] + '𨕷' FROM @DataLoss WHERE [CP1252] = @Test;
    -- ??    ????
    

    As you can see above, the variable compared to, and concatenated with, the Code Page 1252 column produces two questions marks (two instead of one due to the character being a Supplementary Character, which is encoded as a Surrogate Pair in UTF-16, hence the two NCHAR() calls). You can even see this in the execution plan for the second SELECT statement as it shows the second field with the concatenated Supplementary Character already converted into two question marks:

    <ScalarOperator ScalarString="[CP1252]+'??'">
    

     
    I reported this here. (To be fair, this behavior can also happen with Double-Byte Character Sets, so is not new with the UTF-8 collations)
     

  2. Minor truncation from Mixing non-UTF-8 string literals and/or variables with UTF-8 columns (and, in some cases, variables):

    Truncation occurs when characters being converted to UTF-8 require more bytes in UTF-8 than they did in their source encoding, and the target VARCHAR column / variable / literal does not have enough available space to contain all of those bytes. This can happen with Extended ASCII characters (values 128 – 255), two-byte sequences in Double-Byte Character Sets, and most Unicode BMP characters (code points U+0800 / 2048 – U+FFFF / 65535).

    In some instances it’s possible to have values truncated when moving from NVARCHAR to UTF-8 VARCHAR. Typically this is not a problem because Datatype Precedence will convert the VARCHAR into NVARCHAR. However, this cannot happen if the value is being stored in a VARCHAR column or variable.

    The following example needs to be executed in a database that has a UTF-8 default collation:

    -- Code Point U+20BE requires either NVARCHAR(1), or VARCHAR(3) for UTF-8
    DECLARE @UTF8 VARCHAR(5) = NCHAR(0x20BE) + N'1234',
            @UTF16 NVARCHAR(5) = NCHAR(0x20BE) + N'1234';
    
    SELECT @UTF8 AS [UTF-8], @UTF16 AS [UTF-16];
    /*
    UTF-8    UTF-16
    ₾12      ₾1234
    */
    

    The next few examples need to be executed in a database that is not using a UTF-8 default collation, so we will create the DB first. The main issue here is that string literals are assumed to have a Max Length of their current length. For most Code Pages, that’s one byte per character. This doesn’t leave any extra space for additional bytes that might be required for the same character(s) when converting that data to UTF-8:

    -- USE [master]; DROP DATABASE [CodePage1255];
    CREATE DATABASE [CodePage1255] COLLATE Hebrew_100_CI_AS;
    GO
    USE [CodePage1255];
    
    --------------------------
    
    SELECT '§x' COLLATE Hebrew_100_CI_AS_SC_UTF8;
    -- § (2-byte character fits in the two allocated bytes, leaving no room for 2nd char)
    
    SELECT 'x§' COLLATE Hebrew_100_CI_AS_SC_UTF8;
    -- x (1-byte character fits in the two allocated bytes, leaving only 1 byte left;
    -- not enough room for 2-byte character so 2nd byte is left empty)
    
    --------------------------
    
    DECLARE @TooSmall VARCHAR(2);
    
    SET @TooSmall = '§x';
    SELECT @TooSmall, @TooSmall COLLATE Hebrew_100_CI_AS_SC_UTF8;
    -- §x    §
    
    --------------------------
    
    -- DROP TABLE #Test;
    CREATE TABLE #Test
    (
        [UTF8] VARCHAR(5) COLLATE Hebrew_100_CI_AS_SC_UTF8
    );
    
    INSERT INTO #Test ([UTF8]) VALUES (NCHAR(0x20BE)); -- 3 bytes in UTF-8, 2 bytes in UTF-16
    INSERT INTO #Test ([UTF8]) VALUES (NCHAR(0x20BE) + NCHAR(0x20BE)); -- 6 bytes in UTF-8
    /*
    Msg 2628, Level 16, State 1, Line XXXXX
    String or binary data would be truncated in table 'tempdb.dbo.#Test_..._000000000003',
        column 'UTF8'. Truncated value: '₾'.
    */
    
    INSERT INTO #Test ([UTF8])
      SELECT CHAR(224) + CHAR(225) + CHAR(226) + CHAR(227) + CHAR(228); -- אבגדה
    
    SELECT * FROM #Test;
    /*
    ₾
    אב
    */
    
    
    SELECT [UTF8] + '-אבגדה' AS [Truncated]
    FROM   #Test;
    /*
    ₾-אב
    אב-אב
    */
    

     
    I reported this here. (To be fair, this behavior can also happen with Double-Byte Character Sets, so is not new with the UTF-8 collations)
     

  3. Invalid byte sequences in UTF-8 can throw an error instead of returning the default replacement character "�":

    All non-UTF-8 encodings supported by SQL Server handle invalid byte sequences by returning either nothing (for incomplete multi-byte sequences), a non-printable (i.e. invisible) character, a question mark ("?"), or the Unicode default replacement character ("�"). For example:

    DECLARE @Invalid TABLE
    (
      [CP1252]       VARCHAR(10) COLLATE Latin1_General_100_CI_AS,
      [CP1252-Bytes] AS (CONVERT(VARBINARY(10), [CP1252])),
      [CP949]        VARCHAR(10) COLLATE Korean_100_CI_AS, -- Double-Byte
      [CP949-Bytes]  AS (CONVERT(VARBINARY(10), [CP949])),
      [UTF16]        NVARCHAR(10) -- collation is irrelevant in this case
    );
    
    INSERT INTO @Invalid ([CP1252], [CP949], [UTF16])
    VALUES (0x90, 0x90, 0xFFFF), -- 0x90 = incomplete CP949 2-byte sequence
           (0x8F, 0xAFAF, 0xDDDDDDDD),
           (0x9D, 0x9D5D, 0xDDDD5D); -- 0x5D = incomplete UTF-16 sequence
    
    SELECT * FROM @Invalid;
    /*
    CP1252    CP1252-Bytes    CP949    CP949-Bytes    UTF16
              0x90                     0x             □
              0x8F            ?        0xAFAF         ��
              0x9D            ?        0x9D5D         �]
    */
    

    The UTF-8 collations, however, are not as friendly, and can throw a hard error if the string contains certain invalid UTF-8 byte sequence. Other invalid sequences will usually result in either nothing (for incomplete multi-byte sequences) or the Unicode default replacement character ("�"). For example:

    DECLARE @InvalidUTF8 TABLE
    (
      [ID]    INT NOT NULL,
      UTF8    VARCHAR(10) COLLATE Latin1_General_100_CI_AS_SC_UTF8,
      [Bytes] AS (CONVERT(VARBINARY(50), [UTF8]))
    );
    
    -- End with incomplete two-byte sequence:
    INSERT INTO @InvalidUTF8 ([ID], [UTF8]) VALUES (1, 0x4546C2);
    -- End with one "additional" byte:
    INSERT INTO @InvalidUTF8 ([ID], [UTF8]) VALUES (2, 0x454680);
    -- End with two invalid bytes:
    INSERT INTO @InvalidUTF8 ([ID], [UTF8]) VALUES (3, 0x4546C0C0);
    -- End with valid byte after two invalid bytes:
    INSERT INTO @InvalidUTF8 ([ID], [UTF8]) VALUES (4, 0x4546C0C047);
    -- End with three "additional" bytes:
    INSERT INTO @InvalidUTF8 ([ID], [UTF8]) VALUES (5, 0x4546808080);
    -- End with valid byte after three "additional" bytes (invalid):
    INSERT INTO @InvalidUTF8 ([ID], [UTF8]) VALUES (6, 0x454680808047);
    -- End with incomplete three-byte sequence:
    INSERT INTO @InvalidUTF8 ([ID], [UTF8]) VALUES (7, 0x45E182);
    -- End with valid byte after incomplete three-byte sequence:
    INSERT INTO @InvalidUTF8 ([ID], [UTF8]) VALUES (8, 0x45E18246);
    -- End with four "additional" bytes:
    INSERT INTO @InvalidUTF8 ([ID], [UTF8]) VALUES (9, 0x4580808080);
    /*
    Msg 9833, Level 16, State 2, Line XXXXX
    Invalid data for UTF8-encoded characters
    */
    -- Invalid byte:
    INSERT INTO @InvalidUTF8 ([ID], [UTF8]) VALUES (10, 0x80);
    /*
    Msg 9833, Level 16, State 2, Line XXXXX
    Invalid data for UTF8-encoded characters
    */
    
    
    SELECT * FROM @InvalidUTF8 ORDER BY [ID];
    /*
    ID    UTF8      Bytes
    1     EF        0x4546
    2     E         0x45
    3     E         0x45
    4     EF��G     0x4546C0C047
    5     E         0x45
    6     EF���G    0x454680808047
    7     E         0x45
    8     E�F       0x45E18246
    */
    

    As you can see in the results above, rows 1, 2, 3, 5, and 7 succeed, though have fully removed the invalid sequences. Yet, rows 4, 6, and 8 show those same invalid sequences being returned as one or more "�" characters. AND, rows 9 and 10 never get inserted at all due to producing errors.
     
    I reported this here.
     

  4. TEXT / NTEXT datatypes in temporary tables (when Instance-level Collation is "_UTF8") and table variables (when Database-level Collation is "_UTF8"):

      CREATE TABLE [#D'oh] (col1 NTEXT); -- ' (CSS formatting bug work-around)
      /*
      Msg 4188, Level 16, State 1, Line XXXXX
      Column or parameter 'col1' has type 'ntext' and collation
         'Japanese_XJIS_140_CI_AI_VSS_UTF8'. The legacy LOB types do not
         support UTF-8 or UTF-16 encodings. Use types varchar(max),
         nvarchar(max) or a collation which does not have the _SC or
         _UTF8 flags.
      */
      

    Of course, there is an easy work-around to these errors: for columns using either of these datatypes, use the COLLATE keyword to specify a non-UTF8 Collation (e.g. "COLLATE Latin1_General_100_CI_AS").

  5. sys.fn_trace_gettable: The "TextData" column of the result set uses the NTEXT datatype. This should only be an issue when using a UTF-8 Collation at the Instance level. There does not seem to be a work-around for this because, according to:

    EXEC sp_helptext N'sys.fn_trace_gettable';
    

    it’s selecting from the system function: OpenRowset(TABLE TRCTABLE, @filename, @numfiles) . Of course, the documentation does state that this function is deprecated and that you should instead be using Extended Events.

  6. sys.fn_get_sql: The "text" column of the result set uses the TEXT datatype (which was wrong even before this function was deprecated since the datatype should have been NTEXT 😲 ):

      SELECT * FROM sys.fn_get_sql(0x03);
      /*
      Msg 4189, Level 16, State 0, Procedure fn_get_sql,
         Line XXXXX [Batch Start Line YYYYY]
      Cannot convert to text/ntext or collate to
         'Japanese_XJIS_140_CI_AI_VSS_UTF8' because these legacy LOB
         types do not support UTF-8 or UTF-16 encodings. Use types
         varchar(max), nvarchar(max) or a collation which does not
         have the _SC or _UTF8 flags.
      Msg 4413, Level 16, State 1, Line XXXXX
      Could not use view or function 'fn_get_sql' because of binding errors.
      */
      

    This should only be an issue when using a UTF-8 Collation at the Instance level. Fortunatly, there’s a fairly easy work-around for this one: switch to using sys.dm_exec_sql_text() (which is all that sys.fn_get_sql() does, anyway, according to: EXEC sp_helptext N'sys.fn_get_sql'; ).

You can use the following query to check your system to see if there are any columns using either the TEXT or NTEXT datatypes. This does not check either stored procedure / trigger / function / SQL Agent T-SQL job steps content to see if these datatypes are being used in temporary tables or table variables or permanent tables being created or altered, nor does it check T-SQL code submitted by anything outside of SQL Server (such as your app code). So be sure to test, test, test (which you were going to do anyway, right?).

Now, to be fair, these same quirks exist when using any Supplementary Character-Aware Collation (i.e. one with either "_SC" or "_140_" in its name) at the Instance or Database level.

Please note that the following query only reports objects for the database in which it’s being executed (also, the two deprecated functions mentioned above will show up in all databases as they are global).

SELECT OBJECT_SCHEMA_NAME(ac.[object_id])
                 COLLATE Latin1_General_100_CI_AS_SC AS [SchemaName],
       ao.[name] COLLATE Latin1_General_100_CI_AS_SC AS [ObjectName],
       ao.[type_desc] AS [ObjectType],
       ac.[name] COLLATE Latin1_General_100_CI_AS_SC AS [ColumnName],
       ac.[system_type_id],
       ac.[collation_name]
FROM   sys.all_columns ac
INNER JOIN sys.all_objects ao
        ON ao.[object_id] = ac.[object_id]
WHERE  ac.[system_type_id] IN (35, 99) -- 35 = TEXT ; 99 = NTEXT
ORDER BY [SchemaName] ASC,
         [ObjectName] ASC,
         [ColumnName] ASC;


 

Recommended Uses / Guidance

The UTF-8 encoding, being a variable-length encoding, can be a huge benefit in some scenarios, but it can also make things worse in others. While required for projects needing to support Unicode while maintaining ASCII compatibility, its usefullness for compression is rather limited given that Data Compression and Clustered Columnstore Indexes are available across all editions of SQL Server. The only scenario that truly benefits from a UTF-8 encoding for the purpose of saving space is one in which all of the following conditions are true:

  1. Data is mostly standard ASCII (values 0 – 127), but either has, or might have, a small amount of a varying range of Unicode characters (more than would be found on a single 8-bit Code Page, or might not exist on any 8-bit Code Page)
  2. Column is currently (or otherwise would be) NVARCHAR(MAX) (meaning, data won’t fit into NVARCHAR(4000))
  3. There is a lot of data for this column or set of columns (1 GB or more when stored in NVARCHAR)
  4. Performance would be negatively impacted by making the table a Clustered Columnstore table (due to how the table is used) OR data is typically < 8000 bytes
  5. There is no desire to make the column VARBINARY(MAX), use COMPRESS() for INSERT and UPDATE operations, and use DECOMPRESS() for SELECT queries (no need to worry about lack of ability to index the VARBINARY value since it is MAX data anyway that cannot be indexed). Keep in mind that the Gzipped value will be much smaller than even the UTF-8 version of the string, though it would require decompressing before the value could be filtered on (outside of "=") or manipulated.
  6. The benefits of reducing the size of backups and reducing the time it takes to backup and restore, and reducing the impact on the buffer pool, outweigh the cost of the likely negative impact on query performance (for both CPU and elapsed times). Just keep in mind that Backup Compression (available in Enterprise and Standard Editions) might help here.

Storing HTML pages is a good example of a scenario that fits this description. UTF-8 is, of course, the preferred encoding for the interwebs precisely due to it using the minimal space for the most common characters while still allowing for the full range of Unicode characters.

Additional points to consider:

  • If you are going to use a UTF-8 Collation on one or more columns, make no assumptions about size reduction or better performance! Be sure to capture current size and performance stats while column is NVARCHAR, and then compare both after changing to UTF-8.
     
  • If you are going to use UTF-8 on some columns, you should probably use it on all columns and as the database’s default collation in order to avoid the possiblity (even if only minor) of truncation. At the very least be sure to thoroughly test all code paths in your application to make sure that none are exhibiting the truncation behavior.
     
  • If you are going to use UTF-8 as the database’s default collation, you should probably use it on all VARCHAR columns in order to avoid the possiblity of data loss. Or else, at the very least be sure to thoroughly test all code paths in your application to make sure that none are exhibiting the data-loss behavior.


 

Conclusion

It depends (HA!). UTF-8 can be really good in some scenarios, but also quite bad in others. The success or failure of this feature (i.e. the UTF-8 collations) is determined by how it is used.

Savior

There are two use-cases that the UTF-8 encoding is a very good match for:

  1. Existing app code where all of the following conditions are true:
    1. uses 8-bit strings / VARCHAR
    2. needs to now support Unicode
    3. app code cannot be changed (3rd party, legacy, would take too long to convert app and DB to NVARCHAR)
  2. App code (new or existing) that needs to:
    1. support multiple RDBMSs.
    2. not rely on NVARCHAR datatypes and "N"-prefixed string literals.

False Prophet

Being the preferred encoding for the web, and even the default encoding for Linux / Unix (at least some flavors), does not imply that UTF-8 is truly useful in all scenarios. UTF-8 is widely misunderstood as a means of saving space. That was not a design-goal of UTF-8, so it should not be entirely suprising if using it doesn’t save space and/or hurts performance a little. Of course, UTF-8 might work great for you, but that is not something you should assume.

Within the context of SQL Server, UTF-8 has the potential for helping a very specific, but not unimportant, use case: NVARCHAR(MAX) columns, where Unicode Compression doesn’t work. If your goal is purely space-savings (and not ASCII compatibility), then please vote for the following two suggestions to let Microsoft know that they should prioritize them:

  • getting Unicode Compression (part of Data Compression) to work on at least in-row values of NVARCHAR(MAX), if not also off-row values (please vote for: “unicode compression nvarchar(max)” — this would save more space than UTF-8)
  • getting Data Compression to work for off-row LOB / Row-Overflow values (please vote for: “Page Compression for Out-of-Row LOB Storage” — this would save space for more than just Unicode strings: it would also apply to VARCHAR, VARBINARY, and XML)

 

Finally, while this is of little immediate consequence, we should not ignore the long-term impact of adding "_UTF8" to the collation names. Adding the UTF-8 collations was a 39% increase in the number of collations. Now, if another option is to be added later, there could be up to 3982 collation names to duplicate, depending on which collations would get the new flag.


 

Importing and Exporting

Please note that all of the following options were available at least two versions prior to SQL Server 2019.

sqlcmd Utility

SQLCMD.EXE is a command-line utility mainly used to execute T-SQL scripts and ad hoc queries, but can also be used to import and export data. You specify UTF-8 via the "-f" switch (e.g. -f 65001 ).

UTF-8 encoded output files will include the 3 byte Byte Order Mark (BOM).

sqlcmd -Q "SELECT NCHAR(0xD83C) + NCHAR(0xDF2D) AS [Hot Dog (U+1F32D)];" -o C:\temp\_sqlcmd_out.txt -f 65001

I’m not entirely sure when support for reading from, and writing to, UTF-8 encoded files was added. I have been able to export UTF-8 using bcp version 11.0.2270.0, published on 2013-01-14.

bcp Utility

BCP.EXE (short for “Bulk CoPy”) is a command-line utility used to import and export data. Reading from, and writing to, UTF-8 encoded files via code page 65001 is handled via the "-C" switch (e.g. -C 65001 ).

UTF-8 encoded output files will not include the three-byte Byte Order Mark (BOM).

bcp "SELECT NCHAR(0xD83C) + NCHAR(0xDF2D) AS [Hot Dog (U+1F32D)];" queryout C:\temp\_bcp_out.txt -T -c -C 65001

bcp initially added support for UTF-8 in SQL Server 2016. Then, it was also added to SQL Server 2014, Service Pack 2 (SP2). It’s possible that it got back-ported even farther as I have been able to export UTF-8 using bcp version 11.0.2270.0 .

BULK INSERT

BULK INSERT is a T-SQL command that is used to import data (not export). It is very similar to importing via the command-line bcp utility, but from within SQL Server. Specify UTF-8 using WITH (CODEPAGE = 65001).

BULK INSERT [SchemaName].[TableName] FROM 'C:\temp\_bulk_insert.txt' WITH (CODEPAGE = 65001);

Support for UTF-8 in BULK INSERT was initially added in SQL Server 2016. Then, it was also added to SQL Server 2014, Service Pack 2 (SP2).

XML

The XML datatype was introduced in SQL Server 2005, and it allows for converting many encodings, including UTF-8, into UTF-16. Please note that this “trick” only works for converting from various source encodings into UTF-16 Little Endian (as that’s how the XML datatype in SQL Server stores strings internally). This method cannot be used to convert UTF-16 into UTF-8 or any other encoding.

The steps to do this are:

  1. Append UTF-8 encoded bytes to the following string literal:
        '<?xml version="1.0" encoding="UTF-8"?>'
        
  2. The string literal (directly above) and the resulting string (whether a literal or variable) must be VARCHAR. Using an NVARCHAR variable or prefixing any string involved in the concatenation with an upper-case "N" will cause the resulting string to be NVARCHAR, and that will cause an encoding error (since NVARCHAR is UTF-16 yet in the XML declaration we are stating that the encoding is UTF-8.
  3. Convert the resulting string into XML.
  4. Convert the XML value into NVARCHAR.

Here are a few examples of this method being used:






 

BUG / RESTRICTION / ISSUE TRACKING

The following is a consolidated list of the various issues surrounding the UTF-8 collations:

Issue Status Release
In-memory OLTP Open 😿
UTF-8 in SQL 2019: Add Turkish_100_BIN2_UTF8 collation to handle Turkish / Azeri case mappings for "i" and “I” Open 😿
Inconsistent handling of invalid UTF-8 byte sequences Open? 😿
SSMS Always Encrypted wizard makes incorrect collation change for UTF8 columns Open or fixed in RC1? 😿
SqlClient incorrectly decrypted and encrypted Always Encrypted data for UTF-8 collations Open or fixed in RC1? 😿
UTF-8 in SQL 2019: binary collation ( Latin1_General_100_BIN2_UTF8 ) should work with NTEXT instead of getting Msg 4189, Level 16 error Open 😿
Collation Precedence allows for Unicode character data loss concatenating UTF8 with higher precedence non-UTF8 VARCHAR Won't Fix 🙀
Collation Precedence allows for silent truncation (data loss) implicitly converting Extended ASCII characters into UTF8 Won't Fix 🙀
CHAR() function broken in databases with “_UTF8” default Collation Won't Fix 🙀
Can’t specify UTF-8 collation at the instance (server) level Fixed in: CTP 2.1
NVARCHAR to VARCHAR(MAX) can get Msg 8152 "String or binary data would be truncated" Fixed in: CTP 2.1
Invalid UTF-8 bytes get (Msg 682, Level 22, State 148) Internal error. Buffer provided to read column value is too small. OR get mysterious, changing 3rd byte of 2-byte value Fixed in: CTP 2.1
An error occurred while executing batch. Error message is: Internal connection fatal error. Error state: 15, Token : 2″ (disconnect) Fixed in: CTP 2.1
UTF-8 Collations don’t work with Replication Fixed in: CTP 2.2
CHARACTER_SET_NAME of INFORMATION_SCHEMA.COLUMNS is NULL when UTF-8 is used Fixed in: CTP 2.3
Install fails with "Could not find the Database Engine startup handle." if a Unicode-only Collation is selected Fixed in: CTP 2.5
There are no binary Collations that support the UTF8 encoding!!! Why not? Fixed in: CTP 3.0
Linked Servers Fixed in: CTP 3.2
Two collation problems with external tables in SQL 2019 Fixed in: RC1


 

Resources


  1. According to Unicode Technical Note #14: A SURVEY OF UNICODE COMPRESSION (the linked PDF in this link):
    Because of the 1-byte representation of ASCII, many English-speaking users have come to regard UTF-8 as a compression format. This notion is puzzling to speakers of French and other Latin-script languages that use many non-ASCII characters, and ridiculous to users of almost every other script on Earth, for which UTF-8 uses at least one byte per character more than legacy encodings.
     
    Although UTF-8 is indeed “not extravagant” in size, there is a big difference between avoiding wastefulness and achieving compression. The primary reason to use UTF-8 should be to maintain ASCII transparency, not to achieve compression.
     
  2. According to Unicode Technical Note #12: UTF-16 FOR PROCESSING, Section 3: “UTF-8: Good for Compatibility”:
    UTF-8 was mainly designed to store Unicode filenames in an ASCII-friendly way. It is suitable for processing, but it is significantly more complex to process than UTF-16…
     
    …UTF-8 stores Latin text in a compact form compared to UTF-16 but does not provide any advantages for other scripts or even uses more space. There are other Unicode charsets (for example, SCSU and BOCU-1) that are more efficient for storage and data exchange.

    Please note that “SCSU” is the “Standard Compression Scheme for Unicode”, and “BOCU-1” is the “MIME-compatible application of the Binary Ordered Compression for Unicode (BOCU) algorithm”.
      

11 thoughts on “Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?”

  1. As someone who has dedicated hours of personal research into understanding the differences between Unicode and UTF-8 in general, and with Sql Server in particular, your opening paragraphs are fantastically clear and well explained. Much appreciated. Working with federal data that is english only, we too have been plagued with the issue of 99% of strings are varchar but occassionally some esoteric hawaian characters not covered by the Latin collation show up. Here’s hoping the Sql Server team addresses the issues you found and get’s this right before final release.

  2. It’s worth mentioning that ISO has some conventions where their encodings, such as metric units, have to use the basic Latin alphabet and a set of punctuation marks, which are roughly UTF–8. Unicode is required to support this little subset in all of its languages, so that standard encodings can be used anywhere on earth.

    1. Hi Joe. I think you are misstating / misunderstanding what is going on here. I thought I had pointed out how the 1-byte range of UTF-8 (which is really only half of what 1 byte can represent) is exactly the same as 7-bit US-ASCII. It seems that I did not include that info before, but I have updated in a few places to be clearer about that association. That range, especially values 32 – 126 (see the chart just above the “Setting and Storing UTF-8 Values” section), are common across most code pages / character sets. That commonality across code pages is what made those particular characters become the only allowable characters across many standards and RFCs, many of which were published prior to the existence of UTF-8, which came out circa 1996. For example, the RFC for SMTP, published in 1982, was updated in 2008 and still refers to US-ASCII: RFC for SMTP (search for “ascii”).

      One of the design requirements of UTF-8 was that it work seemlessly on existing systems (UNIX in particular), without needing to convert any existing text files. This is why characters encoded using just the first 7 bits of UTF-8 match exactly to the 7-bit US-ASCII encoding. So it’s not that the subset is “roughly UTF-8”, since UTF-8 can encode the entire 1.1+ million character range of Unicode; it’s that a subset of UTF-8 is exactly standard ASCII.

      Also, Unicode doesn’t have languages; it has encodings. And there are only three (mainly): UTF-8, UTF-16, and UTF-32. But, due to endianness, both UTF-16 and UTF-32 have two different physical representations, making it 5 encodings in practice. BUT, it is only UTF-8 that has the ability to encode that subset of characters to be physically the same as 7-bit standard ASCII. For example, Latin Upper-case “A” is 0x65 in both US-ASCII and UTF-8. Yet it is either 0x0065 or 0x6500 in UTF-16, depending on endianness. While the non-zero byte of UTF-16 is the same as the only byte in UTF-8 and US-ASCII, it’s still 2 bytes vs 1 byte, and that makes a big difference when parsing a series of bytes.

      Finally, it is not the encodings that are standard (since there are 5 of them). It is the character set of Unicode that is truly standard, as it is the same across all 5 encodings. For example, “Upside-Down Face” is Code Point U+1F643 across all five encodings (see here – scroll down to the “Encodings” section).

Leave a Reply