All New Collations in SQL Server 2017 Implicitly Support Supplementary Characters

As I was testing whether or not the new “Variation Selector Sensitive” ( _VSS ) Collations in SQL Server 2017 would assist in the following question on DBA.StackExchange:

SELECT query: filter unable to distinguish between two different characters

I ran into what initially looked to be an unfortunate situation. I wanted to see the full list of newly added Collations, so I ran the following query:

SELECT col.[name]
FROM   sys.fn_helpcollations() col
WHERE  col.[name] LIKE N'%[_]140[_]%';

That query returns the following list of 68 Collations:

 
Japanese_Bushu_Kakusu_140_BIN Japanese_Bushu_Kakusu_140_BIN2 Japanese_Bushu_Kakusu_140_CI_AI Japanese_Bushu_Kakusu_140_CI_AI_KS Japanese_Bushu_Kakusu_140_CI_AI_KS_VSS Japanese_Bushu_Kakusu_140_CI_AI_KS_WS Japanese_Bushu_Kakusu_140_CI_AI_KS_WS_VSS Japanese_Bushu_Kakusu_140_CI_AI_VSS Japanese_Bushu_Kakusu_140_CI_AI_WS Japanese_Bushu_Kakusu_140_CI_AI_WS_VSS Japanese_Bushu_Kakusu_140_CI_AS Japanese_Bushu_Kakusu_140_CI_AS_KS Japanese_Bushu_Kakusu_140_CI_AS_KS_VSS Japanese_Bushu_Kakusu_140_CI_AS_KS_WS Japanese_Bushu_Kakusu_140_CI_AS_KS_WS_VSS Japanese_Bushu_Kakusu_140_CI_AS_VSS Japanese_Bushu_Kakusu_140_CI_AS_WS Japanese_Bushu_Kakusu_140_CI_AS_WS_VSS Japanese_Bushu_Kakusu_140_CS_AI Japanese_Bushu_Kakusu_140_CS_AI_KS Japanese_Bushu_Kakusu_140_CS_AI_KS_VSS Japanese_Bushu_Kakusu_140_CS_AI_KS_WS Japanese_Bushu_Kakusu_140_CS_AI_KS_WS_VSS Japanese_Bushu_Kakusu_140_CS_AI_VSS Japanese_Bushu_Kakusu_140_CS_AI_WS Japanese_Bushu_Kakusu_140_CS_AI_WS_VSS Japanese_Bushu_Kakusu_140_CS_AS Japanese_Bushu_Kakusu_140_CS_AS_KS Japanese_Bushu_Kakusu_140_CS_AS_KS_VSS Japanese_Bushu_Kakusu_140_CS_AS_KS_WS Japanese_Bushu_Kakusu_140_CS_AS_KS_WS_VSS Japanese_Bushu_Kakusu_140_CS_AS_VSS Japanese_Bushu_Kakusu_140_CS_AS_WS Japanese_Bushu_Kakusu_140_CS_AS_WS_VSS Japanese_XJIS_140_BIN Japanese_XJIS_140_BIN2 Japanese_XJIS_140_CI_AI Japanese_XJIS_140_CI_AI_KS Japanese_XJIS_140_CI_AI_KS_VSS Japanese_XJIS_140_CI_AI_KS_WS Japanese_XJIS_140_CI_AI_KS_WS_VSS Japanese_XJIS_140_CI_AI_VSS Japanese_XJIS_140_CI_AI_WS Japanese_XJIS_140_CI_AI_WS_VSS Japanese_XJIS_140_CI_AS Japanese_XJIS_140_CI_AS_KS Japanese_XJIS_140_CI_AS_KS_VSS Japanese_XJIS_140_CI_AS_KS_WS Japanese_XJIS_140_CI_AS_KS_WS_VSS Japanese_XJIS_140_CI_AS_VSS Japanese_XJIS_140_CI_AS_WS Japanese_XJIS_140_CI_AS_WS_VSS Japanese_XJIS_140_CS_AI Japanese_XJIS_140_CS_AI_KS Japanese_XJIS_140_CS_AI_KS_VSS Japanese_XJIS_140_CS_AI_KS_WS Japanese_XJIS_140_CS_AI_KS_WS_VSS Japanese_XJIS_140_CS_AI_VSS Japanese_XJIS_140_CS_AI_WS Japanese_XJIS_140_CS_AI_WS_VSS Japanese_XJIS_140_CS_AS Japanese_XJIS_140_CS_AS_KS Japanese_XJIS_140_CS_AS_KS_VSS Japanese_XJIS_140_CS_AS_KS_WS Japanese_XJIS_140_CS_AS_KS_WS_VSS Japanese_XJIS_140_CS_AS_VSS Japanese_XJIS_140_CS_AS_WS Japanese_XJIS_140_CS_AS_WS_VSS
 

As you scroll through the list, does anything appear to be missing? Go ahead, take another look. I’ll wait :wink:. Figure it out? Yep, that’s right: none of those Collations end in “_SC“. Collations ending in “_SC” were added in SQL Server 2012 and support the full UTF-16 character set. Without the “_SC” ending, built-in functions and comparisons / sorting only fully support the base UCS-2 character set (i.e. the first 65,536 Unicode Code Points). You can, of course, store and view all Unicode Code Points, even Supplementary Characters, in non-“_SC” Collations, but they will be interpreted as being two separate “unknown” characters instead of as being one single character.

To illustrate this point, the following query shows: the two surrogate Code Points (not actual characters by themselves, but when used in pairs of any of the assigned combinations, produce a single character), the resulting Supplementary Character, and how the LEN function interprets that sequence in both non-“_SC” and “_SC” Collations.

SELECT NCHAR(0xD83D) AS [HighSurrogate],
       NCHAR(0xDC31) AS [LowSurrogate],
       NCHAR(0xD83D) + NCHAR(0xDC31) AS [Supplementary
 ͨ  ͪ  ͣ  ͬ  ͣ  ͨ  ͭ  ͤ  ͬ],
       LEN(NCHAR(0xD83D) + NCHAR(0xDC31)
              COLLATE Latin1_General_100_CI_AS) AS [Non-SC-Aware],
       LEN(NCHAR(0xD83D) + NCHAR(0xDC31)
              COLLATE Latin1_General_100_CI_AS_SC) AS [SC-Aware];

As you can see below, the individual surrogate Code Points display as the question mark in a dark diamond, yet combined they display as the “Cat Face” emoji ( U+1F431 ): 🐱 . You can also see that a non-“_SC” Collation counts the Surrogate Pair as 2 characters instead of 1 like the “_SC” Collation does.

SupplementaryCharacterResultSet2-650x71

With that in mind, I created the following query to test each Collation that is neither a SQL Server Collation nor one of the “_SC” Collations:

DECLARE @SQL NVARCHAR(MAX) = N'';

SELECT @SQL += N'IF (LEN(NCHAR(0xD83D) + NCHAR(0xDC31) COLLATE '
            + col.[name] + N') = 1) PRINT ''' + col.[name] + N''';'
            + NCHAR(0x000D) + NCHAR(0x000A)
FROM sys.fn_helpcollations() col
WHERE col.[name] NOT LIKE N'SQL[_]%'
AND col.[name] NOT LIKE N'%[_]SC'

--EXEC SQL#.Util_Print @SQL

EXEC (@SQL);

That query concatenates each Collation name into an IF statement that prints the name of the Collation only if the LEN function returns “1” when supplied the “cat face” character. The first three generated lines are:

IF (LEN(NCHAR(0xD83D) + NCHAR(0xDC31) COLLATE Albanian_BIN) = 1) PRINT 'Albanian_BIN';
IF (LEN(NCHAR(0xD83D) + NCHAR(0xDC31) COLLATE Albanian_BIN2) = 1) PRINT 'Albanian_BIN2';
IF (LEN(NCHAR(0xD83D) + NCHAR(0xDC31) COLLATE Albanian_CI_AI) = 1) PRINT 'Albanian_CI_AI';
...

The results of that query (the one that produced all of those IF statements) were 64 out of the 68 new Collations. The only Collations not returned were the four binary Collations (ending in either “_BIN” or “_BIN2“), and they weren’t expected to return since they, by their very nature, do not support Supplementary Characters.

Conclusion

While there is no mention of this behavior either on the Collation and Unicode Support page (under “Supplementary Characters” or “Japanese Collations added in SQL Server 2017”) or even the What’s new in Database Engine – SQL Server 2017 page, it appears that Microsoft is taking the approach that all new Collations inherently support Supplementary Characters. This is actually very much appreciated as there is no reason to ever not support Supplementary Characters. It just would have been nice to have had that noted somewhere in the official documentation.

I suppose I can submit an edit for the “Collation and Unicode Support” page.


UPDATE (2017-11-06)

As of 2017-10-31, edits that I submitted via Pull Request (PR) #175 to correct the two documentation pages noted above (in the Conclusion) have been merged into the main branch and are viewable to the public. If you want to see the specific changes I made to each page, click on the link for the PR and then go to the “Files changed” tab.

1 thought on “All New Collations in SQL Server 2017 Implicitly Support Supplementary Characters”

Thoughts? Questions? Comments? Suggestions? Words of praise?

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s