No, Binary Collations are not Case-Sensitive

(last updated: 2019-06-03 @ 16:00 EST / 2019-06-03 @ 20:00 UTC )

Quite often people will use, or will recommend using, a binary Collation (one ending in “_BIN” or “_BIN2“) when wanting to do a case-sensitive operation. While in many cases it appears to behave as expected, it is best to not use a binary Collation for this purpose. The problem with using binary Collations to achieve case-sensitivity is that they have no concept of linguistic rules and cannot equate different versions of characters that should be considered equal. And the reason why using a binary Collation often appears to work correctly is simply the result of working with a set of characters that has no accents or other versions. One such character set (a common one, hence the confusion), is US English (i.e. “A” – “Z” and “a” – “z”; values 65 – 90 and 97 – 122, respectively). However, there are a few areas where binary collations don’t behave as many (most, perhaps?) people expect them to.

Combining Characters

Combining characters are marks added to base characters for various reasons, such as adding accent marks. For more information on these, and to see groups of them, please see:

Unicode includes both the ability to create accented characters as well as the original accented characters from the various 8-bit code pages where the character is a pre-combined base letter with the accent mark. The query below shows that Unicode Code Point U+00DC (decimal value = 220) is the pre-combined character, whereas Code Point U+0308 is merely the two dots that go above some letters (a.k.a. diaeresis) that combines with a regular, non-accented “U” to form the same accented character represented by Code Point U+00DC.

SELECT NCHAR(252) AS [NCHAR(252)], -- 0x00FC         -- ü
       NCHAR(220) AS [NCHAR(220)], -- 0x00DC         -- Ü
       NCHAR(0x0308) AS [NCHAR(0x0308)],             -- ̈ (combining diaeresis)
       N'U' + NCHAR(0x0308) AS [U + NCHAR(0x0308)];  -- Ü

ResultsGrid-TestCharacters

Outside of the slight font difference in the above output, the two capital “Ü”s really are the same character. Users do not care if that accented “U” is a single Code Point or two Code Points or 32 Code Points (if that were even possible). If a user is searching for text and enters in Code Point U+00DC, then if a string containing “U” followed by Code Point U+0308 isn’t returned, that would appear to be a bug. If the filter is supposed to be case-insensitive, then the user submitting a pre-combined lower-case “u” with the accent needs to also return that same “U” followed by Code Point U+0308, as the following example shows:

SELECT NCHAR(0x00FC) + N' vs. U' + NCHAR(0x0308) AS [U+00FC vs "U" + U+0308],
       CASE
         WHEN NCHAR(0x00FC) = N'U' + NCHAR(0x0308)
                                COLLATE Latin1_General_100_CI_AS THEN 'Same!'
         ELSE 'Different'
       END AS [Case-INsensitive];

ResultsGrid-Combining-CaseInsensitive

But, if the filter is case-sensitive, then that pre-combined lower-case “u” with the accent shouldn’t match a “U” followed by Code Point U+0308, as the following example shows:

SELECT NCHAR(0x00FC) + N' vs. U' + NCHAR(0x0308) AS [U+00FC vs "U" + U+0308],
       CASE
         WHEN NCHAR(0x00FC) = N'U' + NCHAR(0x0308)
                                COLLATE Latin1_General_100_CS_AS THEN 'Same!'
         ELSE 'Different'
       END AS [Case-Sensitive];

ResultsGrid-Combining-CaseSensitive-Different

Of course, if the filter is case-sensitive then the user submitting the pre-combined upper-case “U” with the accent needs to match a “U” followed by Code Point U+0308, as the following example shows:

SELECT NCHAR(0x00DC) + N' vs. U' + NCHAR(0x0308) AS [U+00DC vs "U" + U+0308],
       CASE
         WHEN NCHAR(0x00DC) = N'U' + NCHAR(0x0308)
                                COLLATE Latin1_General_100_CS_AS THEN 'Same!'
         ELSE 'Different'
       END AS [Case-Sensitive];

ResultsGrid-Combining-CaseSensitive-Same

And that brings us to one of the problems with binary Collations: they naturally only look at each individual byte (or Code Point) and have no concept that some combinations of Code Points really should equate to one or more other Code Points. The following example shows that was previously considered the same is now seen as different:

SELECT NCHAR(0x00DC) + N' vs. U' + NCHAR(0x0308) AS [U+00DC vs "U" + U+0308],
       CASE
         WHEN NCHAR(0x00DC) = N'U' + NCHAR(0x0308)
                                COLLATE Latin1_General_100_BIN2 THEN 'Same!'
         ELSE 'Different'
       END AS [Binary];

ResultsGrid-Combining-Binary

Fullwidth Characters

In Unicode and at least some (maybe all) double-byte character sets, the US English alphabet (26 letters, both upper- and lower- case) is duplicated, along with 0 – 9 and some punctuation, so that each one can take up twice as much space (side-to-side; height is still the same) and thereby align better with certain East Asian characters. For example:

SELECT tab.[col], LEN(tab.[col]) AS [Length]
FROM   (VALUES (N'=oo=') /* 2 standard "o"s */,
               (N'=o=') /* 1 full width "o" */) tab(col);

returns:

ResultsGrid-FullWidth-TestCharacters

As you can see, the fullwidth “o” takes up twice as much horizontal space as the regular / normal / standard “o”. You can read more about them and see the entire Unicode block here: Halfwidth and Fullwidth Forms.

One note before proceeding: not all fonts have the fullwidth characters, and some that do have these characters do not have the combining characters. The following list of fonts are the ones that I found to have both.

  • MS Gothic
  • Unifont
  • Yu Gothic
  • Yu Gothic UI
  • Yu Mincho

The Unifont font had to be downloaded and installed, but the others, I assume, came with Windows 10. For the fullwidth characters, I am using the “Yu Gothic” font. The font used for all other examples on this page used the “Noto Sans” font, but it didn’t contain the fullwidth characters so I needed something else. Unifont, while a very cool project to represent all assigned Code Points in the Unicode Basic Multilingual Plane (BMP), has a rather poor implementation of the fullwidth characters.

The following query shows that a lower-case regular “o” also matches the fullwidth lower-case “o”, but not the upper-case regular or fullwidth “O”s as it is a case-sensitive Collation:

SELECT tab.[CodePoint], NCHAR(tab.[CodePoint]) AS [Character],
       N'b'  + NCHAR(tab.[CodePoint]) + N'b' AS [BOB]
FROM   (VALUES (0x004F) /* O */, (0xFF2F) /* O */, (0x006F) /* o */,
               (0xFF4F) /* o */) tab(CodePoint)
WHERE  NCHAR(tab.[CodePoint]) = NCHAR(0x006F) COLLATE Latin1_General_100_CS_AS;

ResultsGrid-FullWidth-CaseSensitive

The following query shows that a lower-case regular “o” only matches itself, and not the fullwidth lower-case “o”, when using a binary Collation, since they are two different Code Points (even if that difference is only for display purposes):

SELECT tab.[CodePoint], NCHAR(tab.[CodePoint]) AS [Character],
       N'b'  + NCHAR(tab.[CodePoint]) + N'b' AS [BOB]
FROM   (VALUES (0x004F) /* O */, (0xFF2F) /* O */, (0x006F) /* o */,
               (0xFF4F) /* o */) tab(CodePoint)
WHERE  NCHAR(tab.[CodePoint]) = NCHAR(0x006F) COLLATE Latin1_General_100_BIN2;

ResultsGrid-FullWidth-Binary

Of course, there are times when one needs to distinguish between regular and fullwidth characters. In such situations, you should first consider using the width-sensitive version of your Collation (i.e. it will have _WS in its name). Given that we are already using Latin1_General_100_CS_AS, we can just change that to be Latin1_General_100_CS_AS_WS and it will, for the simple examples above, return the same single row that the binary Collation does. However, unlike the binary Collation, it can still work correctly with combining characters, and it will work in the case-insensitive versions of that collation (e.g. “o” matches “o” and “O” but not the fullwidth versions of them).

Accent Insensitivity

In SQL Server it is possible to use a case-sensitive yet accent-insensitive Collation. The following query shows that when using such a Collation, several characters match a lower-case “o” yet the upper-case versions of them do not:

SELECT tab.[CodePoint], NCHAR(tab.[CodePoint]) AS [Character]
FROM   (VALUES (0x006F) /* o */, (0x00F4) /* ô */, (0x00F8) /* ø */,
               (0x014F) /* ŏ */, (0x01A3) /* ƣ */, (0x022D) /* ȭ */,
               (0x0277) /* ɷ */, (0x1E53) /* ṓ */, (0x24AA) /* ⒪ */,
               (0x24DE) /* ⓞ */, (0x00D4) /* Ô */) tab(CodePoint)
WHERE  NCHAR(tab.[CodePoint]) = NCHAR(0x006F) COLLATE Latin1_General_100_CS_AI;

ResultsGrid-AccentInsensitivity-CSAI

The following query shows that the binary Collation does not allow for such matches:

SELECT tab.[CodePoint], NCHAR(tab.[CodePoint]) AS [Character]
FROM   (VALUES (0x006F) /* o */, (0x00F4) /* ô */, (0x00F8) /* ø */,
               (0x014F) /* ŏ */, (0x01A3) /* ƣ */, (0x022D) /* ȭ */,
               (0x0277) /* ɷ */, (0x1E53) /* ṓ */, (0x24AA) /* ⒪ */,
               (0x24DE) /* ⓞ */, (0x00D4) /* Ô */) tab(CodePoint)
WHERE  NCHAR(tab.[CodePoint]) = NCHAR(0x006F) COLLATE Latin1_General_100_BIN2;

ResultsGrid-AccentInsensitivity-BIN2

Sorting

Binary collations are used typically used for comparisons and not so much for sorting, and the following examples illustrate why.

The following query, without any ORDER BY, most likely has the same order as it is defined in the VALUES list:

SELECT tab.col AS [Character], ASCII(tab.col) AS [Value]
FROM   (VALUES ('a'), ('B'), ('Z'), ('y'), ('Y'), ('ñ'),
        ('A'), ('z'), ('~'), ('b')) tab(col);

ResultsGrid-Sort-NoOrder

The following query uses a case-sensitive Windows Collation:

SELECT tab.col AS [Character], ASCII(tab.col) AS [Value]
FROM   (VALUES ('a'), ('B'), ('Z'), ('y'), ('Y'), ('ñ'),
        ('A'), ('z'), ('~'), ('b')) tab(col)
ORDER BY tab.[col] COLLATE Latin1_General_100_CS_AS;

ResultsGrid-Sort-CaseSensitive-Windows

Please take note of the following from the results above:

  1. The tilde “~” sorts first.
  2. The letters — upper-case and lower-case versions — are grouped together.
  3. The lower-case versions of each letter sort before the upper-case versions.
  4. The “ñ” sorts in the middle.

The following query uses a case-sensitive SQL Server Collation, which behaves slightly differently than the Windows Collation:

SELECT tab.col AS [Character], ASCII(tab.col) AS [Value]
FROM   (VALUES ('a'), ('B'), ('Z'), ('y'), ('Y'), ('ñ'),
        ('A'), ('z'), ('~'), ('b')) tab(col)
ORDER BY tab.[col] COLLATE SQL_Latin1_General_CP1_CS_AS;

ResultsGrid-Sort-CaseSensitive-SQLServer

Please take note of the following from the results above:

  1. The tilde “~” sorts first.
  2. The letters — upper-case and lower-case versions — are grouped together.
  3. The upper-case versions of each letter sort before the lower-case versions.
  4. The “ñ” sorts in the middle.

#3 is different between the SQL Server Collation and the Windows Collation.

The following query uses a binary Collation, which again behaves differently, but even more so than the last one:

SELECT tab.col AS [Character], ASCII(tab.col) AS [Value]
FROM   (VALUES ('a'), ('B'), ('Z'), ('y'), ('Y'), ('ñ'),
        ('A'), ('z'), ('~'), ('b')) tab(col)
ORDER BY tab.[col] COLLATE Latin1_General_100_BIN2;

ResultsGrid-Sort-Binary-Windows

Please take note of the following from the results above:

  1. Unlike before, the tilde “~” now sorts after the US English letters, but before any other letters.
  2. Unlike before, the upper-case and lower-case versions are not grouped together.
  3. Unlike before, the “ñ” sorts last, not even mixed in with other letters of the same case.

All of the differences noted above are due to the ordering in a binary Collation being based entirely upon the value and nothing else.

Expansions and Contractions

Expansions

“Expansion” is when a single character (Code Point) can be broken-out into, and equate to, multiple other characters, such as “æ” into “a” and “e”. In the example below, the all lowercase “n” and “j” do not match the single character “Nj” (which has an uppercase “N”). But, the uppercase “N” followed by a lowercase “j” does match as it’s the same casing and this is a case-sensitive collation. That same casing, however, cannot match when using a binary collation as the underlying bytes are different.

Please note that in the example below, the “N” and “j” are concatenated only to more clearly show that they are indeed two separate characters. It is not a special syntax, and is functionally the same as “Nj” together in a single string literal.

SELECT NCHAR(0x01CB)
WHERE NCHAR(0x01CB) = N'n' + N'j' COLLATE Latin1_General_100_CS_AS;
-- empty result set / 0 rows returned

SELECT NCHAR(0x01CB)
WHERE NCHAR(0x01CB) = N'N' + N'j' COLLATE Latin1_General_100_CS_AS;
-- Nj

SELECT NCHAR(0x01CB)
WHERE NCHAR(0x01CB) = N'N' + N'j' COLLATE Latin1_General_100_BIN2;
-- empty result set / 0 rows returned

Contractions

“Contraction” is when a multi-character sequence is treated as being a separate, individual entity. For example, the “CH” and “LL” combinations in Spanish (Traditional, not Modern):

SELECT tab.*
FROM   (VALUES (1, N'CG'), (2, N'CH'), (3, N'CI'), (4, N'CZ'),
            (5, N'DA'), (6, N'DG'), (7, N'DH'), (8, N'DI')) tab(ord, col)
ORDER BY tab.[col] COLLATE Traditional_Spanish_100_CS_AS
/*
ord    col
1      CG
3      CI
4      CZ
2      CH  <- out of sequence for English, but IN sequence for Spanish
5      DA
6      DG
7      DH
8      DI
*/


SELECT tab.*
FROM   (VALUES (1, N'CG'), (2, N'CH'), (3, N'CI'), (4, N'CZ'),
            (5, N'DA'), (6, N'DG'), (7, N'DH'), (8, N'DI')) tab(ord, col)
ORDER BY tab.[col] COLLATE Traditional_Spanish_100_BIN2
/*
ord    col
1      CG
2      CH <- in byte / code unit order
3      CI
4      CZ
5      DA
6      DG
7      DH
8      DI
*/

Not only do contractions sort differently than their individual characters would, but they also compare as combinations and not as their individual characters. For example, in Vietnamese collations, the “p” and “h” characters individually behave as separate characters (as you would expect). But, when “p” is followed by “h”, then they are no longer “p” or “h”, but only the combination of “ph”:

SELECT CHARINDEX(N'p', N'E.T. phone home.' COLLATE Vietnamese_100_CS_AS);
-- 0

SELECT CHARINDEX(N'h', N'E.T. phone home.' COLLATE Vietnamese_100_CS_AS);
-- 12

SELECT CHARINDEX(N'ph', N'E.T. phone home.' COLLATE Vietnamese_100_CS_AS);
-- 6

SELECT CHARINDEX(N'PH', N'E.T. phone home.' COLLATE Vietnamese_100_CS_AS);
-- 0



SELECT CHARINDEX(N'p', N'E.T. phone home.' COLLATE Vietnamese_100_BIN2);
-- 6

SELECT CHARINDEX(N'h', N'E.T. phone home.' COLLATE Vietnamese_100_BIN2);
-- 7

 

Conclusion

While there are certainly scenarios in which a binary Collation is appropriate, or even necessary, if the desire is to have case-sensitivity, then your first attempt should be to use a case-sensitive Collation. Only use a binary Collation to achieve case-sensitivity if you encounter one of the few edge cases where a case-sensitive Collation doesn’t filter out everything that it should.

3 thoughts on “No, Binary Collations are not Case-Sensitive”

  1. […] When needing to know if two values are 100% identical, you need a binary Collation (such as when auditing for changes, etc). Even if all characters had correct sort weights, you still wouldn’t be able to differentiate between a base character plus a combining diacritic and the pre-composed single character equivalent. Nor would you be able to differentiate between a base character plus multiple combining diacritics and the same characters with a different order for the diacritics. In both of those cases, the visual representation of the “character” is the same, but the number or order of the underlying Code Points would be different. The first item from this post provides a working example: No, Binary Collations are not Case-Sensitive […]

Leave a Reply to Binary Collation Case-Sensitivity – Curated SQLCancel reply