Differences Between the Various Binary Collations (Cultures, Versions, and BIN vs BIN2)

ComparingApplesToApples

(last updated: 2019-06-14 @ 00:03 EST / 2019-06-14 @ 04:03 UTC )

Binary collations are, in many ways, simpler to understand than non-binary collations because they don’t have any complex, culture-based linguistic rules. They just work with the character values and there is little room for mystery: all characters have underlying values; if two characters have the same underlying value then they are equal, else they are not; and the underlying value is the order, right?

Still, there are some complexities related to binary collations that you might not be aware of. To figure out what they are, we need to look at why there are so many binary collations in the first place. I mean, binary collations work on the underlying values of the characters, and comparing numbers doesn’t change between cultures or versions: 12 = 12, 12 > 11, and 12 <13, always. So, then what is the difference between:

  • Latin1_General_100_BIN2 and either Hebrew_100_BIN2 or Turkish_100_BIN2 (only the culture is different), or
  • Latin1_General_100_BIN2 and Latin1_General_BIN2 (only the version is different), or
  • Latin1_General_100_BIN2 and Latin1_General_100_BIN (only the binary comparison type is different)
  • Latin1_General_100_BIN2 and Latin1_General_100_BIN2_UTF8 (only the 8-bit encoding is different)

Let’s take a closer look at each of those four cases.

 

Different Cultures

From the names alone we can see that Latin1_General_100_BIN2, Hebrew_100_BIN2, and Turkish_100_BIN2 are all “_100_BIN2” collations, and the difference between them is just “Latin1_General” vs “Hebrew” vs “Turkish” (i.e. the culture). Any differences in behavior related to the culture depend on if we are working with Unicode or non-Unicode data.

 

NVARCHAR Data

For NVARCHAR data, a difference in culture is almost entirely meaningless: with very few exceptions, the behavior of all binary collations of the same version and same binary comparison type is exactly the same. However, it just so happens that two cultures / languages, Turkish and Azerbaijani / Azeri, have two separate “i” characters — dotted and dotless (like the Sneetches, but without the bigotry) — each having both uppercase and lowercase forms. In most Latin-based languages, lowercase “i” (dotted) and uppercase “I” (dotless) are two forms of the same character. But, in Turkish and Azerbaijani (“Azeri” in terms of SQL Server collation names):

  • lowercase “i” (dotted) has an uppercase form of “İ” (dotted; U+0130)
  • uppercase “I” (dotless) has a lowercase form of “ı” (dotless; U+0131)

Now, the default uppercase mapping of “i” is “I”, and the default lowercase mapping for “I” is “i”. This is the behavior for all cultures that do not override the default. Cultures that don’t use Latin-based characters don’t override these defaults, which is why they behave this same way in languages such as Hebrew, Arabic, Chinese, Japanese, etc. But Turkish and Azerbaijani (which has collations in SQL Server for both Cyrillic and Latin scripts) do override the default behavior, and so are able to work as expected for those cultures / languages.

We can test this behavior using the following two queries. Each one returns the name of any binary collation that does not return the default lowercase or uppercase value, and what the actual returned value was. Please keep in mind that the character values returned are in decimal form, hence 305 in hex is 0x0131, and 304 in hex is 0x0130 (matching the code point values shown above).

Check for lowercase dotless “i”

DECLARE @SQL NVARCHAR(MAX) = N'DECLARE @Counter INT = 1,
        @Lower INT;
';

SELECT @SQL += REPLACE(N'
  SET @Lower = UNICODE(LOWER(N''I'' COLLATE {Name}));
  IF(@Lower <> UNICODE(''i''))
  BEGIN
    RAISERROR(N''%4d.  %-55s (%d)'', 10, 1, @Counter, N''{Name}'', @Lower)
      WITH NOWAIT;
    SET @Counter += 1;
  END;
', N'{Name}', col.[name]) + NCHAR(13) + NCHAR(10)
FROM   sys.fn_helpcollations() col
WHERE  col.[name] LIKE N'%[_]BIN%'
ORDER BY col.[name];

--PRINT @SQL;
EXEC (@SQL);
/*
   1.  Azeri_Cyrillic_100_BIN                                  (305)
   2.  Azeri_Cyrillic_100_BIN2                                 (305)
   3.  Azeri_Latin_100_BIN                                     (305)
   4.  Azeri_Latin_100_BIN2                                    (305)
   5.  Turkish_100_BIN                                         (305)
   6.  Turkish_100_BIN2                                        (305)
   7.  Turkish_BIN                                             (305)
   8.  Turkish_BIN2                                            (305)
*/
GO

Check for uppercase dotted “I”

DECLARE @SQL NVARCHAR(MAX) = N'DECLARE @Counter INT = 1,
        @Upper INT;
';

SELECT @SQL += REPLACE(N'
  SET @Upper = UNICODE(UPPER(N''i'' COLLATE {Name}));
  IF(@Upper <> UNICODE(''I''))
  BEGIN
    RAISERROR(N''%4d.  %-55s (%d)'', 10, 1, @Counter, N''{Name}'', @Upper)
      WITH NOWAIT;
    SET @Counter += 1;
  END;
', N'{Name}', col.[name]) + NCHAR(13) + NCHAR(10)
FROM   sys.fn_helpcollations() col
WHERE  col.[name] LIKE N'%[_]BIN%'
ORDER BY col.[name];

--PRINT @SQL;
EXEC (@SQL);
/*
   1.  Azeri_Cyrillic_100_BIN                                  (304)
   2.  Azeri_Cyrillic_100_BIN2                                 (304)
   3.  Azeri_Latin_100_BIN                                     (304)
   4.  Azeri_Latin_100_BIN2                                    (304)
   5.  Turkish_100_BIN                                         (304)
   6.  Turkish_100_BIN2                                        (304)
   7.  Turkish_BIN                                             (304)
   8.  Turkish_BIN2                                            (304)
*/
GO

There should not be any other differences between cultures for NVARCHAR data.

 

VARCHAR Data

For VARCHAR data, however, a difference in culture is far more impacting. All collations have an associated code page that is the character set used for 8-bit (i.e. VARCHAR) data. Even if that code page is 0, meaning “no code page”, it still determines what characters can be stored in 8-bit datatypes. And in that case, it would be “none”, which is why these collations are called “Unicode-only”, and why they don’t work with VARCHAR data:

SELECT COLLATIONPROPERTY(N'Syriac_100_BIN2', 'CodePage');
-- 0

SELECT 'not gonna work, trust me ;-)' COLLATE Syriac_100_BIN2;
/*
Msg 459, Level 16, State 1, Line XXXXX
Collation 'Syriac_100_BIN2' is supported on Unicode data types
only and cannot be applied to char, varchar or text data types.
*/

No code page means that there is no 8-bit (i.e. VARCHAR) character set. Of course, the difference in culture will be more obvious if we look at a few non-zero code pages:

-- Which 8-bit code page is used for each of these collations:
SELECT COLLATIONPROPERTY(N'Latin1_General_100_BIN2', 'CodePage'),
       COLLATIONPROPERTY(N'Hebrew_100_BIN2', 'CodePage'),
       COLLATIONPROPERTY(N'Korean_100_BIN2', 'CodePage');
-- 1252    1255    949

Now we will inject the same underlying values into each of those three code pages. Please note that we need to use values in the range of 128 – 255 (0x80 – 0xFF) since the 0 – 127 range (0x00 – 0x7F) is the same across all of the code pages that can be represented in SQL Server, including Unicode data (both UTF-16 via NVARCHAR and, starting in SQL Server 2019, UTF-8 via VARCHAR).

DECLARE @Data TABLE
(
  [CodePage-1252]  VARCHAR(50) COLLATE Latin1_General_100_BIN2,
  [CodePage-1255]  VARCHAR(50) COLLATE Hebrew_100_BIN2,
  [CodePage-949]   VARCHAR(50) COLLATE Korean_100_BIN2
);

INSERT INTO @Data VALUES (0xE1FA, 0xE1FA, 0xE1FA);
INSERT INTO @Data VALUES (0xE4E8, 0xE4E8, 0xE4E8);

SELECT * FROM @Data;
/*
CodePage-1252    CodePage-1255    CodePage-949
áú               בת               守
äè               הט               怏
*/

As you can see in the results, the exact same two bytes — 0xE1 and 0xFA — produce very different characters due to the code pages being different (code page 949 is a Double-Byte Character Set (DBCS) which is why there is only a single character returned for the Korean collation). And the same is true for the 0xE4 and 0xE8 bytes. This is the only reason for having more than two “_BIN2” collations for any given version. If not for needing to control the Code Page for VARCHAR data, all that would have been required is the following:

  1. A single, non-Turkish culture collation (Latin1_General would be fine) to handle the most common uppercase/lowercase mappings for dotted “i” and dotless “I”
  2. A single, Turkish culture collation (Turkish would be fine) to handle the less common uppercase/lowercase mappings for dotted “i” and dotless “I” found only in the Turkish, Azeri_Cyrillic, and Azeri_Latin cultures.

As it is, Microsoft could have simply provided a single UnicodeOnly_90_BIN2 collation and a single UnicodeOnly_100_BIN2 collation, instead of the 15 that exist across both of those versions, since the Code Page is the same across all of these:

SELECT ROW_NUMBER() OVER (
  PARTITION BY CASE WHEN col.[name] LIKE N'%[_]90[_]%' THEN 1 ELSE 2 END
  ORDER BY CASE WHEN col.[name] LIKE N'%[_]90[_]%' THEN 1 ELSE 2 END,
           col.[name]
                         ),
       col.[name]
FROM   sys.fn_helpcollations() col
WHERE  col.[name] LIKE N'%[_]BIN2'
AND    COLLATIONPROPERTY(col.[name], 'CodePage') = 0;

/*
1      Divehi_90_BIN2
2      Indic_General_90_BIN2
3      Syriac_90_BIN2

1      Assamese_100_BIN2
2      Bengali_100_BIN2
3      Divehi_100_BIN2
4      Indic_General_100_BIN2
5      Khmer_100_BIN2
6      Lao_100_BIN2
7      Maltese_100_BIN2
8      Maori_100_BIN2
9      Nepali_100_BIN2
10     Pashto_100_BIN2
11     Syriac_100_BIN2
12     Tibetan_100_BIN2
*/

This is also why the new set of UTF-8 collations, starting in SQL Server 2019, only has a single “_BIN2” collation: the Code Page is 65001 across all cultures (remember, Unicode is a singular, all-inclusive character set). And, there is only potential need for one additional UTF-8 “_BIN2” collation to handle the special Turkish / Azeri uppercase/lowercase mappings for “i” and “I”.

 

Different Versions

From the names alone we can see that Latin1_General_100_BIN2 and Latin1_General_BIN2 are both “Latin1_General_*_BIN2” collations, and the difference between them is just “{no number ; implied 80}” vs “100” (i.e. the version). For VARCHAR data this difference is entirely meaningless: the behavior of all binary collations of the same culture is exactly the same.

For Unicode data (mainly NVARCHAR, but starting in SQL Server 2019 CTP 2.3, also VARCHAR if using the new UTF8_BIN2 collation, renamed to Latin1_General_100_BIN2_UTF8 in CTP 3.0), however, there is an important difference. All collations map code points to various properties such as: sort weights (used for sorting and comparisons), numerical values for “numbers”, uppercase and lowercase forms, etc. For example:

SELECT 1 WHERE N'½' COLLATE Latin1_General_100_CI_AS LIKE N'[3-5]';
-- {no rows returned}
SELECT 2 WHERE N'½' COLLATE Latin1_General_100_CI_AS LIKE N'[0-1]';
-- 2

But, it is those uppercase and lowercase mappings that are important here. There are quite a few code points that are missing these mappings, but as each new version of collations is introduced, more and more of the code points are having their mappings added. For example, the following shows a code point that is missing its uppercase mapping in both version 80 and 90 collations (if it’s missing in any particular version, then it’s missing in all prior versions as well, which is why I don’t need a field for the version 80 collation):

SELECT NCHAR(0x1FF3) AS [TheChar],
       UPPER(NCHAR(0x1FF3) COLLATE Japanese_90_BIN2) AS [v90],
       UPPER(NCHAR(0x1FF3) COLLATE Japanese_XJIS_100_BIN2) AS [v100],
       UPPER(NCHAR(0x1FF3) COLLATE Japanese_XJIS_140_BIN2) AS [v140];
/*
TheChar    v90    v100    v140
ῳ          ῳ      ῼ       ῼ
*/

And, the following shows a code point that is missing its uppercase mapping in version 80, 90, and 100 collations:

SELECT NCHAR(0x0250) AS [TheChar],
       UPPER(NCHAR(0x0250) COLLATE Japanese_90_BIN2) AS [v90],
       UPPER(NCHAR(0x0250) COLLATE Japanese_XJIS_100_BIN2) AS [v100],
       UPPER(NCHAR(0x0250) COLLATE Japanese_XJIS_140_BIN2) AS [v140];
/*
TheChar    v90    v100    v140
ɐ          ɐ      ɐ       Ɐ
*/

Ok, so how many of these mappings are missing from each version? I’m glad you asked 😉

The only way we have to determine which mappings are missing is by looking at the differences between the oldest version (highest number of missing mappings) and newest version (least number of missing mappings) collations. Logically, this means that since there are four versions — 80, 90, 100, and 140 — the version 80 data will naturally be “all missing” while the version 140 data will naturally be “none missing”. There might be additional mappings missing from all versions, but a) we have no way to determine that now as that requires cross-referencing the official Unicode Character Database, which is possible but way out of scope, and b) any potential missing mappings aren’t relevant to the question of “what are the differences between the existing collations available in SQL Server”.

The query below finds all code points that return the same character that was passed into the LOWER function when using a version 80 collation, but that also return a different character when the same initial character was passed in, but this time using a version 140 collation. For the version 90 and 100 collations, if the LOWER function returns a character that is different from what was passed in, the lowercase character is displayed, else it will be NULL. The final row is a summary line showing the totals. A field for the version 80 collations is included just to show clearly that there is no lowercase mapping for those characters.

;WITH nums AS
(
  SELECT TOP (65534) ROW_NUMBER() OVER(ORDER BY @@SPID) AS [num]
  FROM   master.sys.all_columns ac1
  CROSS JOIN master.sys.all_columns ac2
), chars AS
(
SELECT nums.[num] AS [CodePoint],
       CONVERT(BINARY(2), nums.[num]) AS [CodePointHex],
       NCHAR(nums.[num]) AS [ TheChar ],
       LOWER(NCHAR(nums.[num]) COLLATE Japanese_BIN2) AS [v80],
       '    ' AS [----],
       IIF(LOWER(NCHAR(nums.[num]) COLLATE Japanese_90_BIN2)
             <> NCHAR(nums.[num]),
           LOWER(NCHAR(nums.[num]) COLLATE Japanese_90_BIN2),
           NULL) AS [v90],
       IIF(LOWER(NCHAR(nums.[num]) COLLATE Japanese_XJIS_100_BIN2)
             <> NCHAR(nums.[num]),
           LOWER(NCHAR(nums.[num]) COLLATE Japanese_XJIS_100_BIN2),
           NULL) AS [v100],
       IIF(LOWER(NCHAR(nums.[num]) COLLATE Japanese_XJIS_140_BIN2)
             <> NCHAR(nums.[num]),
           LOWER(NCHAR(nums.[num]) COLLATE Japanese_XJIS_140_BIN2),
           NULL) AS [v140]
FROM   nums
WHERE  LOWER(NCHAR(nums.[num]) COLLATE Japanese_BIN2)
         = NCHAR(nums.[num]) -- mapping not in version 80 collations
AND    LOWER(NCHAR(nums.[num]) COLLATE Japanese_XJIS_140_BIN2)
         <> NCHAR(nums.[num]) -- mapping in version 140 collations
)
SELECT * FROM chars
UNION ALL
SELECT NULL, NULL, N'TOTALS:', 
       CONVERT(NVARCHAR(5), COUNT(chars.[v80])),
       '    ' AS [----],
       CONVERT(NVARCHAR(5), COUNT(chars.[v90])),
       CONVERT(NVARCHAR(5), COUNT(chars.[v100])),
       CONVERT(NVARCHAR(5), COUNT(chars.[v140]))
FROM chars;

/*
CodePoint    HexValue     TheChar    v80    --    v90    v100    v140
...
502          0x01F6       Ƕ          Ƕ            NULL   ƕ       ƕ
...                       
8498         0x2132       Ⅎ          Ⅎ            NULL   ⅎ       ⅎ
...                       
11374        0x2C6E       Ɱ          Ɱ            NULL   NULL    ɱ
...                       
42820        0xA744       Ꝅ          Ꝅ            NULL   NULL    ꝅ
...
NULL         NULL         TOTALS:    305           0     200     305
*/

I have included a random(ish) sampling of 4 rows from the result set above, plus the final / summary line. The summary shows that there are 305 characters missing their lowercase mapping in the version 80 collations. The version 90 collations add no mappings (sad). But, the version 100 collations add 200 mappings, and the version 140 collations added another 105 mappings, giving it 305 mappings that are missing from the version 80 and 90 collations.

Interesting, but we aren’t done yet. The lowercase mappings are just one direction. We need to include the uppercase mappings as well. And, while it’s tempting, we should not assume that the missing uppercase mappings are merely the uppercase versions of the characters that are missing their lowercase mappings. Each code point has an uppercase or lowercase mapping (or can have), and these are completely independent, so just because one entry was added doesn’t mean that any other mapping has been added, even if it seems like it should have been.

The query below finds all code points that return the same character that was passed into the UPPER function when using a version 80 collation, but that also return a different character when the same initial character was passed in, but this time using a version 140 collation. For the version 90 and 100 collations, if the UPPER function returns a character that is different from what was passed in, the uppercase character is displayed, else it will be NULL. The final row is a summary line showing the totals. A field for the version 80 collations is included just to show clearly that there is no uppercase mapping for those characters.

;WITH nums AS
(
  SELECT TOP (65534) ROW_NUMBER() OVER(ORDER BY @@SPID) AS [num]
  FROM   master.sys.all_columns ac1
  CROSS JOIN master.sys.all_columns ac2
), chars AS
(
SELECT nums.[num] AS [CodePoint],
       CONVERT(BINARY(2), nums.[num]) AS [CodePointHex],
       NCHAR(nums.[num]) AS [ TheChar ],
       UPPER(NCHAR(nums.[num]) COLLATE Japanese_BIN2) AS [v80],
       '    ' AS [----],
       IIF(UPPER(NCHAR(nums.[num]) COLLATE Japanese_90_BIN2)
             <> NCHAR(nums.[num]),
           UPPER(NCHAR(nums.[num]) COLLATE Japanese_90_BIN2),
           NULL) AS [v90],
       IIF(UPPER(NCHAR(nums.[num]) COLLATE Japanese_XJIS_100_BIN2)
             <> NCHAR(nums.[num]),
           UPPER(NCHAR(nums.[num]) COLLATE Japanese_XJIS_100_BIN2),
           NULL) AS [v100],
       IIF(UPPER(NCHAR(nums.[num]) COLLATE Japanese_XJIS_140_BIN2)
             <> NCHAR(nums.[num]),
           UPPER(NCHAR(nums.[num]) COLLATE Japanese_XJIS_140_BIN2),
           NULL) AS [v140]
FROM   nums
WHERE  UPPER(NCHAR(nums.[num]) COLLATE Japanese_BIN2)
         = NCHAR(nums.[num]) -- mapping not in version 80 collations
AND    UPPER(NCHAR(nums.[num]) COLLATE Japanese_XJIS_140_BIN2)
         <> NCHAR(nums.[num]) -- mapping in version 140 collations
)
SELECT * FROM chars
UNION ALL
SELECT NULL, NULL, N'TOTALS:', 
       CONVERT(NVARCHAR(5), COUNT(chars.[v80])),
       '    ' AS [----],
       CONVERT(NVARCHAR(5), COUNT(chars.[v90])),
       CONVERT(NVARCHAR(5), COUNT(chars.[v100])),
       CONVERT(NVARCHAR(5), COUNT(chars.[v140]))
FROM chars;

/*
CodePoint    HexValue     TheChar    v80    --    v90    v100    v140
...
1165         0x048D       ҍ          ҍ            NULL    Ҍ       Ҍ
...
8097         0x1FA1       ᾡ          ᾡ            NULL    ᾩ       ᾩ
...
42805        0xA735       ꜵ          ꜵ           NULL    NULL    Ꜵ
...
42849        0xA761       ꝡ          ꝡ           NULL    NULL    Ꝡ
...
NULL         NULL         TOTALS:    344          0      238     344
*/                       

I have included a random(ish) sampling of 4 rows from the result set above, plus the final / summary line. The summary shows that there are 344 characters missing their uppercase mapping in the version 80 collations. The version 90 collations add no mappings (sad). But, the version 100 collations add 238 mappings, and the version 140 collations added another 106 mappings, giving it 305 mappings that are missing from the version 80 and 90 collations.

Putting the data from both queries together, the total totals are:

  • Version 80: missing 649 mappings (305 + 344)
  • Version 90: missing 649 mappings (305 + 344)
  • Version 100: added 438 mappings (200 + 238) over versions 80 and 90 ; only missing 211 mappings (105 + 106, or 649 – 438)
  • Version 140: added 211 mappings (105 + 106) over version 100 ; added all 649 over versions 80 and 90

This data shows why:

  1. if possible, it is important to use the newest version available (version 100 was introduced in SQL Server 2008, 140 was introduced in SQL Server 2017)! Please keep in mind that these mappings are just one difference. There are also many thousands of sort weights missing from the earlier versions (almost 20k missing from version 80, almost 13k missing from version 90, and almost 5k missing from version 100).
  2. the Latin1_General_100_BIN2_UTF8 collation, new in CTP 3.0 of SQL Server 2019, is better than having a version 80 collation, which it was as UTF8_BIN2 which was introduced in CTP 2.3, and is much, much better than having no binary UTF-8 collation at all (which was the case prior to CTP 2.3). However, even being a version 100 collation, it’s still missing 211 uppercase / lowercase mappings 😿 (though certainly better than missing all 649 of them!)

 

Different Binary Comparison Types

From the names alone we can see that Latin1_General_100_BIN2 and Latin1_General_100_BIN are both “Latin1_General_100” collations, and the difference between them is just “_BIN” vs “_BIN2” (i.e. the binary comparison type). For non-UTF-8 VARCHAR data this difference is entirely meaningless: the behavior of all binary collations of the same culture is exactly the same.

For Unicode data (mainly NVARCHAR, but starting in SQL Server 2019 CTP 2.3, also VARCHAR if using the new UTF8_BIN2 collation, renamed to Latin1_General_100_BIN2_UTF8 in CTP 3.0), however, there is an important difference. When comparing values, the behavior of all binary collations is exactly the same. So, what we are interested in is sorting.

According to the Microsoft documentation page for Collation and Unicode Support:

In a BIN2 collation all characters are sorted according to their code points. In a BIN collation only the first character is sorted according to the code point, and remaining characters are sorted according to their byte values. (Because the Intel platform is a little endian architecture, Unicode code characters are always stored byte-swapped.)

That means: the behavior is exactly the same for the first character only. So, if you are testing the difference between “_BIN” and “_BIN2” by sorting a list of single characters, then you will never see a difference. In order to see the difference in behavior, you need to skip the first character. The following test shows this by providing both a field with a single character, and a field with a prefix character that is the same for both rows, cancelling out the effect of the first character on the sorting:

SELECT tab.[ID],
       tab.[col] AS [TheChar],
       N'_' + tab.[col] AS [PrefixedChar],
       '--' AS [--],
       ROW_NUMBER() OVER (ORDER BY tab.[col]
          COLLATE Latin1_General_100_BIN) AS [TheCharBIN],
       ROW_NUMBER() OVER (ORDER BY N'_' + tab.[col]
          COLLATE Latin1_General_100_BIN) AS [PrefixedBIN],
       '--' AS [--],
       ROW_NUMBER() OVER (ORDER BY tab.[col]
          COLLATE Latin1_General_100_BIN2) AS [TheCharBIN2],
       ROW_NUMBER() OVER (ORDER BY N'_' + tab.[col]
          COLLATE Latin1_General_100_BIN2) AS [PrefixedBIN2]
FROM   (VALUES (1, NCHAR(0x1225)), (2, NCHAR(0x2113))) tab([ID], [col])
ORDER BY tab.[ID];
/*
ID   Char   Prefixed   --   Char   Prefixed   --   Char   Prefixed
                            BIN    BIN             BIN2   BIN2
1    ሥ     _ሥ        --    1      2         --     1      1
2    ℓ     _ℓ         --    2      1          --    2       2
*/

The “PrefixedBIN” field shows a difference because starting at byte #3, the “_BIN” collation sees 0x25 and 0x13 (due to the byte-swapping of the Little Endian encoding), while the “_BIN2” collation sees 0x12 and 0x21. If that is not clear, maybe the following chart will help. And keep in mind we are working with the “Prefixed” value that effectively skips the first character since there is no sorting difference with that first position. The chart below shows the character and its underlying bytes starting at character position / index #2.

CodePoint Character UTF-16 LE BIN ordering BIN2 ordering
U+1225 0x2512 0x25 then 0x12 0x12 then 0x25
U+2113 0x1321 0x13 then 0x21 0x21 then 0x13

 

Supplementary Characters

This next query shows how Supplementary Characters are handled across “_BIN“, “_BIN2” collations, as well as non-binary collations of version 90 or higher:

SELECT tab.[ID],
       tab.[col] AS [TheChar],
       N'_' + tab.[col] AS [PrefixedChar],
       '---' AS [---],
       ROW_NUMBER() OVER (ORDER BY tab.[col]
          COLLATE Latin1_General_100_BIN) AS [Char_BIN],
       ROW_NUMBER() OVER (ORDER BY N'_' + tab.[col]
          COLLATE Latin1_General_100_BIN) AS [Prefixed_BIN],
       '---' AS [---],
       ROW_NUMBER() OVER (ORDER BY tab.[col]
          COLLATE Latin1_General_100_BIN2) AS [Char_BIN2],
       ROW_NUMBER() OVER (ORDER BY N'_' + tab.[col]
          COLLATE Latin1_General_100_BIN2) AS [Prefixed_BIN2],
       '---' AS [---],
       ROW_NUMBER() OVER (ORDER BY tab.[col]
          COLLATE Japanese_90_CI_AS) AS [Char_CI_AS],
       ROW_NUMBER() OVER (ORDER BY N'_' + tab.[col]
          COLLATE Japanese_90_CI_AS) AS [Prefixed_CI_AS]
FROM   (VALUES (1, NCHAR(0xFB06)), (2, NCHAR(0xD802) + NCHAR(0xDC2C)),
               (3, NCHAR(0xD83D) + NCHAR(0xDE43))) tab([ID], [col])
ORDER BY tab.[ID];

The query above returns the following 3 rows:

ID Char Prefixed   BIN   BIN2   90_CI_AS
  Char Prefixed   Char Prefixed   Char Prefixed
1 _st   3 2   3 3   1 1
2 𐠬 _𐠬   1 1   1 1   2 2
3 🙃 _🙃   2 3   2 2   3 3

The reason for including this query is to highlight two related nuances of working with UTF-16 (i.e. the Unicode encoding used by NVARCHAR data): one of terminology, and the other of behavior.

Looking at the order number columns (the six right-most columns), we can observe the following:

  1. Both the prefixed and non-prefixed columns for “_BIN2” sorted in the same order.
  2. The non-prefixed characters for both “_BIN” and “_BIN2” sorted in the same order.
  3. The non-prefixed and prefixed characters for “_BIN” did not sort in the same order.

So far, all of those behaviors are expected and consistent with what we have seen previously in this post.

But, the non-binary, version 90 collation columns (the right-most two columns) show an ordering that is both:

  • the same between the non-prefixed and prefixed columns, yet
  • different than either of the two other sort orders seen in the columns to the left

This difference in sort order exposes a technicality that is not often mentioned, probably because most of the time there is little practical benefit in being this technical. By now you are probably aware that the underlying values used in Unicode to uniquely identify each character are called “code points”. A code point, can be encoded in various ways. The Unicode specification provides five options for encoding: UTF-8, UTF-16 Big Endian, UTF-16 Little Endian, UTF-32 Big Endian, and UTF-32 Little Endian (I am leaving out the non-Endian variations of UTF-16 and UTF-32 since they merely imply Big Endian). UTF-8 encodes each code point as a single entity of between 1 and 4 bytes. UTF-32 encodes each code point as a single entity of 4 bytes.

In contrast, UTF-16 encodes each code point as either one or two 2-byte entities. For the first 65,536 code points (the total number of combinations of 2 bytes, 0x00000xFFFF), everything said so far still holds true. But, things get a little confusing when dealing with Supplementary Characters (i.e. all characters outside of the initial 65,536). Supplementary Characters, like all other characters, each have a unique code point. In UTF-8 and UTF-32 there is no difference, but in UTF-16 each Supplementary Character is comprised of two non-Supplementary Characters (i.e. a “surrogate pair”). So what does the term “code point” refer to? Does it refer to the actual Supplementary Character, or does it refer to each of the two-byte pieces that make up the Supplementary Character? It can’t be both. And it isn’t.

In UTF-16, the two-byte entities are actually known as “code units“. And, it just so happens that the first 65,536 code points are each made up of a single code unit having the same value (e.g. code point U+FB06 is code unit FB06). Supplementary Characters are code points comprised of two code units (e.g. code point U+1F643 is comprised of code units D83D and DE43).

Looking back at the Microsoft documentation quote at the beginning of this section, we can now see that the statement, “In a BIN2 collation all characters are sorted according to their code points”, is not entirely accurate. Sure, for non-Supplementary Characters that is true, but dealing with Supplementary Characters shows us that the definition of BIN2 collations is technically incorrect. The BIN2 collations, when dealing with NVARCHAR data, sort by code unit, not by code point.

Am I being unfair, or overly nit-picky? No, because the query and results above show that there are three ways of sorting these values, and it is not possible to explain this behavior without understanding the difference between code unit and code point. Hopefully the chart below will help visualize the difference in these three sorting methods. This chart shows the same three characters in the same progression as the results above: BIN, then BIN2, and then non-binary-version-90-or-newer. But, this time I’m showing how the sorting algorithm saw each character, which will help explain why each approach did something different.

  • Binary (_BIN): this algorithm goes byte-by-byte (after the first character, so after the first two bytes). And, since we are working on a Little Endian platform, the bytes of each base entity are reversed. A UTF-16 entity (i.e. code unit) is always 2 bytes. The value in parenthesis in the “Binary” column is the reversed first code unit, which is why the first byte in the first row is 0x02 instead of 0xD8, and the first byte in the second row is 0x06 instead of 0xFB.
  • Code Unit (_BIN2): this algorithm goes code unit-by-code unit. A code unit’s value stays the same no matter how it is encoded. So the number of bytes and the order that they are in physically does not matter. There is no need to show a different representation in parenthesis in the “CodeUnit” column because D802 is D802, even if that code unit is stored as 0x02D8 on disk.
  • Code Point (non-binary-version-90-or-newer): this algorithm goes code point-by-code point (for Supplementary Characters only; otherwise non-binary collations use culture-specific linguistic rules). The value in parenthesis in the “CodePoint” column is the actual code point, and that is the value used for sorting (not the bytes or the code units).
Binary (_BIN) CodeUnit (_BIN2) CodePoint (_90_CI_AS)
0xD802DC2C (02D8) 0xD802DC2C 0xFB06 (U+FB06)
0xFB06 (06FB) 0xD83DDE43 0xD802DC2C (U+1082C)
0xD83DDE43 (3DD8) 0xFB06 0xD83DDE43 (U+1F643)

Given the description of the “_BIN2” collations as being “code point” ordered yet they are actually “code unit” ordered, which one is correct? According to the following quote from the official Unicode standard (i.e. the specification) in Chapter 2: General Structure / Section 2.5: Encoding Forms (page 39):

All three encoding forms give the same results for binary string comparisons or string sorting when dealing only with BMP characters (in the range U+0000..U+FFFF). However, when dealing with supplementary characters (in the range U+10000..U+10FFFF), UTF-16 binary order does not match Unicode code point order.

it would appear that the behavior is correct, and the documentation should probably be updated.

 

Different Unicode Encodings

Starting in SQL Server 2019 it’s possible to use the UTF-8 encoding. This difference in encoding is actually a special case because it relates to two of the previously mentioned sections:

  1. Different Cultures
  2. Different Binary Comparison Types

As of SQL Server 2019 CTP 3.0, the single binary UTF-8 collation is: Latin1_General_100_BIN2_UTF8. The reason that there’s only one binary UTF-8 collation is because the Code Page is the same across all UTF-8 collations: 65001. Without a change in Code Page value across cultures, there is then absolutely no difference between binary collations of different cultures, with the one exception being the uppercase/lowercase mappings for “i” and “I” for the Turkish and Azeri cultures. Someday there might be a Turkish_100_BIN2_UTF8 collation to handle those less common mappings, but there wouldn’t need to be any more than those two binary collations for UTF-8.

The other interesting aspect of the binary UTF-8 collation relates to sorting. In terms of the three different sorting styles noted just above (i.e. “Binary”, “Code Unit”, and “Code Point”), Latin1_General_100_BIN2_UTF8 is, as its name indicates, a “_BIN2” collation. However, UTF-8 does not have surrogate pairs like UTF-16. Hence, it can do true “Code Point” sorting across all 1.1 million Unicode code points.

The Unicode specification (same section / page as quoted directly above) states:

A binary sort of UTF-8 strings gives the same ordering as a binary sort of Unicode code points. This is obviously the same order as for a binary sort of UTF-32 strings.

We can test with the following query:

SELECT tab.[ID],
       tab.[col] AS [TheCharNVC],
       CONVERT(VARCHAR(10), tab.[col] COLLATE Latin1_General_100_BIN2_UTF8)
         AS [TheCharVC],
       '--' AS [--],
       ROW_NUMBER() OVER (ORDER BY tab.[col]
         COLLATE Latin1_General_100_BIN2_UTF8) AS [CharNVC_BIN2_UTF8],
       ROW_NUMBER() OVER (ORDER BY CONVERT(VARCHAR(10), tab.[col]
         COLLATE Latin1_General_100_BIN2_UTF8)) AS [CharVC_BIN2_UTF8]
FROM   (VALUES (1, NCHAR(0xFB06)),
               (2, NCHAR(0xD802) + NCHAR(0xDC2C)),
               (3, NCHAR(0xD83D) + NCHAR(0xDE43))) tab([ID], [col])
ORDER BY tab.[ID];

which returns:

  ID   TheCharNVC
(UTF-16)
TheCharVC
(UTF-8)
  CharNVC_BIN2_UTF8
(UTF-16)
CharVC_BIN2_UTF8
(UTF-8)
1   3 1
2 𐠬 𐠬   1 2
3 🙃 🙃   2 3

The NVARCHAR columns show us that we are dealing with the same three characters between the UTF-8 and UTF-16 columns, that the ordering is consistent with what we saw previously for NVARCHAR data, and that the ordering is different than the UTF-8 column. We are now interested in UTF-8 behavior, and that can only be found when working with VARCHAR data (and using a UTF8 collation, of course). The chart above shows that yes, the Latin1_General_100_BIN2_UTF8 collation (as of SQL Server 2019 CTP 3.0) does indeed use the code point value for ordering.

4 thoughts on “Differences Between the Various Binary Collations (Cultures, Versions, and BIN vs BIN2)”

Leave a Reply to Solomon RutzkyCancel reply