(last updated: 2021-06-04 @ 14:55 EST / 2021-06-04 @ 18:55 UTC )

The Problem

I do a fair bit of research and writing about collations, encodings, Unicode, etc and have found that in order to do thorough research, I often need to make use of non-standard-ASCII characters. Sometimes this is as easy as doing:

SELECT NCHAR(0xHHHH);

where HHHH are 4 hex digits (i.e. 0-9 and A-F). Of course, while easy, it does take some playing around with different combinations before finding something interesting to work with.

This gets even more difficult with Supplementary Characters. These are characters with code points above the U+FFFF / 65,535 range (meaning, they are beyond the max 2-byte range). Ideally, you would just pass in that higher value to the NCHAR() function. Unicode supports mapping up to 1,114,112 (17 “planes” of 65,536) code points (i.e. characters), though only around 144k ¹ have been allocated so far (as of version 13.0, published on 2020-03-10). So, just pick a value of 65536 (first code point is 0, not 1) or higher, right? Well, there’s a catch: SQL Server’s Unicode support was built around UCS-2 ², which maps the exact same 65,536 initial code points as UTF-16. The difference is that UTF-16 has supplementary characters (the remaining 1,048,576 code points), which are comprised of two code points from the initial / lower 65,536 code points. These combinations are known as surrogate pairs, and there are 2048 code points reserved for this purpose. Both UCS-2 and UTF-16 contain the same set of 2048 surrogate code points, but UTF-16 actually uses them while in UCS-2 they are merely reserved for future use.

Getting back to the question… Why can’t we just do:

SELECT NCHAR(0x12345);

? It all depends on collation. The first several versions of collations operate from a UCS-2 perspective where there are no supplementary characters or surrogate pairs. Thankfully, collations were added in SQL Server 2012 that support supplementary characters, and their names end with "_SC" (“SC” for “Supplementary Characters”, if that wasn’t obvious). More collations were added in SQL Server 2017, and all of them except for the binary collations support supplementary characters. But, their names don’t end in "_SC" as it’s no longer a separate option that needs to be selected; starting with these newest collations, support for supplementary characters is assumed. Now you just look for "_140_" in the name (again, excluding the binary collations).

Problem solved? Not quite. For most of the built-in string functions you can use the COLLATE clause to force a supplementary character-aware collation:

DECLARE @Thingy NVARCHAR(10) = N'𒅑';
SELECT @Thingy AS [U+12151],
       LEN(@Thingy COLLATE Latin1_General_100_CI_AS) AS [Latin1_non-SC],
       LEN(@Thingy COLLATE Latin1_General_100_CI_AS_SC) AS [Latin1_SC],
       LEN(@Thingy COLLATE Japanese_XJIS_140_CI_AS) AS [140_non-binary],
       LEN(@Thingy COLLATE Japanese_XJIS_140_BIN2) AS [140_binary];
/*
U+12151    Latin1_non-SC    Latin1_SC    140_non-binary    140_binary
𒅑      2               1            1                  2
*/

As you can see in the output shown above, the supplementary character-aware collations return a length of 1, while the other collations return a length of 2 because they are only able to see the surrogate pair as two separate code points. The character itself is stored correctly and displays correctly regardless of the collation.

But, you can only attach the COLLATE clause to strings, and you don’t pass a string into the NCHAR() function. So, the only way to get the NCHAR() function to create a supplementary character is to execute it in a database that has a supplementary character-aware default collation. Well, that’s not always possible (though it should be; there’s no good reason for the NCHAR() function to be collation-sensitive, so please vote for: NCHAR() function should always return Supplementary Character for values 0x10000 – 0x10FFFF regardless of active database's default collation).

Fortunately, there is another way to do this. Since supplementary characters are made up of surrogate pairs, we can just create the two surrogate code points. Or, we can just convert the UTF-16LE byte sequence into NVARCHAR. For example:

SELECT NCHAR(0xD808) + NCHAR(0xDD51) AS [SurrogatePair],
       CONVERT(NVARCHAR(10), 0x08D851DD) AS [UTF-16LE_bytes];
/*
SurrogatePair    UTF-16LE_bytes
𒅑            𒅑
*/

Since neither of those methods requires a supplementary character-aware collation, both can be used on versions of SQL Server prior to 2012 (even going all the way back to SQL Server 7.0)!!

Well, we solved one problem but immediately ran into another: how to find either the surrogate pair or the UTF-16LE byte sequence for any given supplementary character?

Solutions: Old and New

I used to go to the Emoticons (Emoji) 1F600—1F64F page of unicode-table.com to copy and paste characters, code points, or check the encoding chart at the bottom of each character page (the “hex” column of both “UTF-16BE” and “UTF-16LE” rows have proven most useful).

But not anymore. Now, I just hit: Ctrl + 0 . When I do that, I get a list of 188,657 code points. Each row contains the official code point value ("U+[H][H]HHHH"), the integer value, the hex value ("0x[H][H]HHHH"), the character itself, the UTF-16 Little Endian byte sequence (how it is actually stored in SQL Server / Windows / .NET, and what you get if you convert an NVARCHAR value to VARBINARY), the surrogate pair values, the T-SQL notation (which does not require using an "_SC" or "_140_" collation), the HTML (and XML) notation ("&#x[H][H]HHHH;"), the style common to many app languages ("\uHHHH" ; used for C++ / C# / F# / Java / JavaScript / Julia / etc), and finally the slightly newer, other common style that can handle all code points, not just BMP ("\UHHHHHHHH" ; used for C / C++ / C# / F# / Julia / etc). I can copy and paste any of those values and use them in queries, emails, blog posts, .NET code, and so on.

Here are a few rows of what I get back:

SELECT cp.*
FROM   dbo.ListAllUnicodeCodePoints(0) cp
WHERE  cp.[CdPntINT] IN (0x006D, 0x0478, 0x0959, 0x10A26);

Code Point	CdPnt INT	CdPnt BIN	Char	UTF-16LE	UTF-8	HighSrgt INT	LowSrgt INT	HighSrgt BIN	LowSrgt BIN	T-SQL	HTML/XML	C#/F#/C++/Julia/ Java[Script]/?	C#/F#/ Java[Script]/?	C#/F#/ C/C++/Julia/?
U+006D	109	0x006D	m	0x6D00	0x6D	NULL	NULL	NULL	NULL	NCHAR(0x006D)	m	\u006D	NULL	\U0000006D
U+0478	1144	0x0478	Ѹ	0x7804	0xD1B8	NULL	NULL	NULL	NULL	NCHAR(0x0478)	Ѹ	\u0478	NULL	\U00000478
U+0959	2393	0x0959	ख़	0x5909	0xE0A599	NULL	NULL	NULL	NULL	NCHAR(0x0959)	ख़	\u0959	NULL	\U00000959
U+10A26	68134	0x10A26	𐨦	0x02D826DE	0xF090A8A6	55298	56870	0xD802	0xDE26	NCHAR(0xD802) + NCHAR(0xDE26)	𐨦	NULL	\uD802\uDE26	\U00010A26

The Code

For Viewing All Code Points / Filtering / Sorting on Something Besides the Code Point

The code below creates an inline table-valued function (iTVF) that returns a list of Unicode code points along with various ways of looking at the value, and notations for creating the code points in various languages.

NOTE: this function returns 188,657 rows, which is more than the current (ver 13.0) number of assigned, non-private use code points. This is done to handle the fact that every year, Unicode assigns an additional small number of code points. Your ability to see the newly assigned code points is determined by the font(s) that you have loaded, not by SQL Server. Hence, it seemed better to return rows that are not assigned (yet) but could be assigned in the near future (and they would display correctly with the current code and an updated font), as opposed to only returning the currently assigned code points which would hide code points assigned in the future until this code is updated (if someone even knew to do that). And, just in case you were wondering: no, there is no way to dynamically / internally verify if a code point has been assigned. Even using SQLCLR to check the “General Category” of each code point wouldn’t help because it likely wouldn’t have the most recent Unicode data. But, even if it did have current data, there’s still no pure T-SQL method of determining which code points have been assigned.

The nums CTE simply generates the maximum addressable number of code points, starting at 0. I used @@MICROSOFTVERSION, a system global variable, for the ORDER BY since it returns an INT and never changes. Some people use @@SPID, which is shorter, but changes with every session. That probably doesn’t matter, but I was just wanting something as close to a constant as possible (constants are not allowed), for reasons that could very well be irrational (since I don’t know for certain that a differing value would cause any additional optimization time).

The chars CTE passes along the generated number, calculates the two code points used to create supplementary characters in UTF-16, known as a surrogate pair. Surrogate pairs can only be created from combinations of a specific 2048 code points, and those 2048 code points have no other purpose (meaning, they are not valid in any way if used individually, or even out of sequence, as in a reversed pair, since some can only be first in the pair, while the others can only be second in the pair). This is why the two "…SurrogateINT" expressions return NULL for code points under 65,536 / U+10000.

Also, the chars CTE filters out most of the unallocated code points since there is no reason to return 925,455 rows of code points that can’t possibly be used. And yes, there are some / several code points in the 0x00000 – 0x14700 and 0x16800 – 0x30000 ranges that are unallocated, but it didn’t seem like it was worth adding lots of AND conditions to filter several small ranges. Still, I might come back later and do that if I find that there are one or two ranges that have a significant number of unallocated rows.

The main query just puts the pieces from the chars CTE together in different ways. The reason for having the CASE WHEN n.[num] > 65535 THEN ... pattern in several places is for formatting. Code points are expressed in hex digits (e.g. "U+1234") and are often created using hex digits: "\u1234" for one of the most common styles across various application languages, "NCHAR(0x1234)" in T-SQL, "ሴ" in HTML / XML (yes, T-SQL and HTML / XML can also be expressed as integer values, but I find it easier to use the hex as it is more consistent, and hence makes it more likely that I will start to remember some of them as I see the same pattern of hex digits ; although, I have yet to remember the code point or surrogate pair for 💩 ). Anywho, it’s easy enough to get the hex representation in T-SQL from the original row number coming from the nums CTE, simply by doing CONVERT(BINARY(n), ...). And when dealing with the main BMP range (i.e. the lower 65,536 code points that are the most commonly used anyway), there are no formatting issues since code points are expressed with a minimum of 4 hex digits (e.g. "U+0042"). The problem can be seen when going above that range and working with supplementary characters. The issue is that when doing the conversion in T-SQL, any leading 0 will be included. And with the highest value for the third byte (i.e. the "xy" in "U+xy1234") being 10, most of the time that "x" of the "xy" will be 0. That would render the code point for the “pile of poo” emoji as "01F4A9" instead of "1F4A9". There is no functional problem here, the leading 0 doesn’t hurt anything, but it looks much nicer (and conforms to how code points are supposed to be expressed, according to the official Unicode specification) to remove the leading 0.

For the "UTF-16LE", "UTF-8", "T-SQL", and "HTML/XML" column aliases: the extra spaces to the right of the word (e.g. [UTF-8{extra spaces}]) are there for a reason. Those extra spaces enforce a minimum width for the column in the results so that as you scroll down, you won’t have to resize any of them just because the first 65,536 rows for the "T-SQL" column are "NCHAR(0xHHHH)" while the remaining 123,121 rows are "NCHAR(0xHHHH) + NCHAR(0xHHHH)", which is just over twice as wide as those initial rows. You may need to adjust (add or remove spaces) since the width is controlled by the font used for Grid Results.

Regarding the "UTF-8" column:

The first version of this function required a "_UTF8" collation, which requires SQL Server 2019, in order to have the optional "UTF-8" column. That is no longer the case. The current version of this function calculates the UTF-8 encoding directly, thus there is no dependency on SQL Server 2019, and that column now works on any version of SQL Server.
There is an input parameter, @EncodeSurrogateCodePointsInUTF8 BIT, for this function that determines how the 2048 surrogate code points are encoded in the UTF-8 column. Officially, those code points are invalid in UTF-8 and are encoded as “unknown” (U+FFFD) in the "_UTF8" collations. Passing in a value of 0 or DEFAULT will emulate that behavior. However, if you ever need the true UTF-8 byte sequences for those surrogate code points, just pass in a value of 1 for @EncodeSurrogateCodePointsInUTF8. The second example in the TESTS section below shows the difference between these two options.
For more info on UTF-8, please see the following posts:
1. Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?
2. How Many Bytes Per Character in SQL Server: a Completely Complete Guide

I added extra newlines after each field in the main select so that it would be easier to read.

The script to create this function, along with the tests, can be found on PasteBin.com: T-SQL Function (iTVF) to List All Unicode Code Points.

GO
CREATE FUNCTION dbo.ListAllUnicodeCodePoints(
                                 @EncodeSurrogateCodePointsInUTF8 BIT = 0)
RETURNS TABLE
AS RETURN
WITH nums AS
(
  SELECT TOP (1114111) (ROW_NUMBER() OVER
                        (ORDER BY @@MICROSOFTVERSION) - 1) AS [num]
  FROM   [master].[sys].[all_columns] ac1
  CROSS JOIN [master].[sys].[all_columns] ac2
), chars AS
(
  SELECT n.[num], -- pass-through
         RIGHT(CONVERT(CHAR(6), CONVERT(BINARY(3), n.[num]), 2),
               CASE
                 WHEN n.[num] > 65535 THEN 5
                 ELSE 4
               END) AS [CodePointHex],
         CONVERT(INT, CASE
                        WHEN n.[num] > 65535 THEN 55232 + (n.[num] / 1024)
                      END) AS [HighSurrogateINT],
         CONVERT(INT, CASE
                        WHEN n.[num] > 65535 THEN 56320 + (n.[num] % 1024)
                      END) AS [LowSurrogateINT]
  FROM   nums n
  WHERE  n.[num] BETWEEN 0x000000 AND 0x014700 -- filter out 925,455 
  OR     n.[num] BETWEEN 0x016800 AND 0x030000 -- unmapped code
  OR     n.[num] BETWEEN 0x0E0001 AND 0x0E01EF -- points
)
SELECT
       'U+' + c.[CodePointHex] AS [CodePoint],

       c.[num] AS [CdPntINT],

       '0x' + c.[CodePointHex] AS [CdPntBIN],

       CASE
         WHEN c.[num] > 65535 THEN NCHAR(c.[HighSurrogateINT])
           + NCHAR(c.[LowSurrogateINT])
         ELSE NCHAR(c.[num])
       END AS [Char],

       CASE
         WHEN c.[num] > 65535 THEN CONVERT(CHAR(10), CONVERT(BINARY(4),
           NCHAR(c.[HighSurrogateINT]) + NCHAR(c.[LowSurrogateINT])), 1)
         ELSE CONVERT(CHAR(6), CONVERT(BINARY(2), NCHAR(c.[num])), 1)
       END AS [UTF-16LE       ],

       '0x' + CASE
                WHEN c.[num] < 128
                  THEN CONVERT(CHAR(4), CONVERT(BINARY(1), c.[num]), 2)
                WHEN c.[num] BETWEEN 128 AND 2047
                  THEN CONVERT(CHAR(2), CONVERT(BINARY(1),
                                                ((c.[num] / 64) + 192)), 2)
                  +    CONVERT(CHAR(2), CONVERT(BINARY(1),
                                                ((c.[num] % 64) + 128)), 2)
                WHEN (@EncodeSurrogateCodePointsInUTF8 = 0)
                       AND (c.[num] BETWEEN 55296 AND 57343)
                  THEN 'EFBFBD'--Replacement (U+FFFD) Surrogate Code Points
                WHEN c.[num] BETWEEN 2048 AND 65535
                  THEN CONVERT(CHAR(2), CONVERT(BINARY(1),
                                 (((c.[num] / 64) / 64) + 224)), 2)
                  +    CONVERT(CHAR(2),
                     CONVERT(BINARY(1), (((c.[num] / 64) % 64) + 128)), 2)
                  +    CONVERT(CHAR(2),
                     CONVERT(BINARY(1), ((c.[num] % 64) + 128)), 2)
                WHEN c.[num] BETWEEN 65536 AND 1114111
                  THEN CONVERT(CHAR(2), CONVERT(BINARY(1),
                                 ((((c.[num] / 64) / 64) / 64) + 240)), 2)
                  +    CONVERT(CHAR(2), CONVERT(BINARY(1),
                                 ((((c.[num] / 64) / 64) % 64) + 128)), 2)
                  +    CONVERT(CHAR(2),
                     CONVERT(BINARY(1), (((c.[num] / 64) % 64) + 128)), 2)
                  +    CONVERT(CHAR(2),
                     CONVERT(BINARY(1), ((c.[num] % 64) + 128)), 2)
                ELSE CONVERT(VARCHAR(15), NULL)
              END AS [UTF-8          ],

       c.[HighSurrogateINT] AS [HighSrgtINT],

       c.[LowSurrogateINT] AS [LowSrgtINT],

       CONVERT(BINARY(2), c.[HighSurrogateINT]) AS [HighSrgtBIN],

       CONVERT(BINARY(2), c.[LowSurrogateINT]) AS [LowSrgtBIN],

       'NCHAR(' + CASE
                    WHEN c.[num] > 65535 THEN CONVERT(CHAR(6), 
                      CONVERT(BINARY(2), c.[HighSurrogateINT]), 1)
                      + ') + NCHAR(' + CONVERT(CHAR(6),
                      CONVERT(BINARY(2), c.[LowSurrogateINT]), 1)
                    ELSE CONVERT(CHAR(6), CONVERT(BINARY(2), c.[num]), 1)
                  END + ')'
         AS [T-SQL                                                  ],

       '&#x' + c.[CodePointHex] + ';' AS [HTML/XML    ],

       CASE
         WHEN c.[num] < 65536 THEN '\u' + CONVERT(CHAR(4),
                                            CONVERT(BINARY(2), c.[num]), 2)
         ELSE CONVERT(VARCHAR(10), NULL)
       END AS [C#/F#/C++/Java[Script]]/Julia/?],

       CASE
         WHEN c.[num] > 65535 THEN '\u' + CONVERT(CHAR(4),
                               CONVERT(BINARY(2), c.[HighSurrogateINT]), 2)
           + '\u' + CONVERT(CHAR(4), CONVERT(BINARY(2),
                                               c.[LowSurrogateINT]), 2)
         ELSE CONVERT(VARCHAR(15), NULL)
       END AS [C#/F#/Java[Script]]/?],

       '\U' + CONVERT(CHAR(8), CONVERT(BINARY(4), c.[num]), 2)
         AS [C#/F#/C/C++/Julia/?]

FROM   chars c;
GO

Tests

List all 188,657 code points:

SELECT cp.*
FROM   dbo.ListAllUnicodeCodePoints(DEFAULT) cp; -- DEFAULT is same as 0

List surrogate code points to show difference in UTF-8 encoding options (Surrogate code points are invalid in UTF-8 and ideally should not be encoded):

SELECT enc.[CodePoint], enc.[CdPntINT], enc.[CdPntBIN], enc.[Char],
       no_enc.[UTF-8          ] AS [UTF-8 conforming],
       enc.[UTF-8          ] AS [UTF-8 encoded]
FROM   dbo.ListAllUnicodeCodePoints(0) no_enc -- 0 is same as DEFAULT
INNER JOIN dbo.ListAllUnicodeCodePoints(1) enc
        ON enc.[CdPntINT] = no_enc.[CdPntINT]
WHERE  no_enc.[CdPntINT] BETWEEN 0xD800 AND 0xDFFF;

List some emoji:

SELECT cp.*
FROM   dbo.ListAllUnicodeCodePoints(0) cp
WHERE  cp.[CdPntINT] BETWEEN 0x1F000 AND 0x1F9FF;

List the Tibetan characters, sorted naturally for that language:

SELECT cp.*
FROM   dbo.ListAllUnicodeCodePoints(0) cp
WHERE  cp.[CdPntINT] BETWEEN 0x0F00 AND 0x0FFF -- Tibetan
ORDER BY  cp.[Char] COLLATE Nepali_100_CS_AS;

List characters that are considered the same as “E” (when using Latin1_General_100_CI_AI):

SELECT cp.*
FROM   dbo.ListAllUnicodeCodePoints(0) cp
WHERE  cp.[Char] = N'E' COLLATE Latin1_General_100_CI_AI
ORDER BY  cp.[CdPntINT];
-- 94 rows!!

List characters that have a numeric value between 0 and 10,000 (for pre-SQL Server 2017, use Latin1_General_100_CI_AI):

SELECT cp.*
FROM   dbo.ListAllUnicodeCodePoints(0) cp
WHERE  cp.[Char] LIKE N'%[0-' + NCHAR(0x2182) + N']%'
                   COLLATE Japanese_XJIS_140_CI_AI--Latin1_General_100_CI_AI
ORDER BY cp.[Char] COLLATE Japanese_XJIS_140_CI_AI--Latin1_General_100_CI_AI
-- 752 rows!! (for Japanese_XJIS_140_CI_AI)
-- 550 rows!! (for Latin1_General_100_CI_AI)

Use directly as a Keyboard Query shortcut

A query shortcut allows you to execute code simply by hitting Ctrl and a number key. Cool. But first, there are a few things to know about query shortcuts:

The window in SSMS where you configure the query shortcuts has the text field for the shortcut labeled as “Stored Procedure”, which is misleading because you can specify a query. You can even specify multiple queries.
Whatever you specify needs to be a single line: no newlines / CRLFs. Any text past the first return will be truncated.
When using the keyboard query shortcuts, if nothing is highlighted then only the code in the shortcut is executed. But, if any T-SQL is highlighted when you execute the shortcut, then the highlighted code is executed after the code stored in the shortcut finishes. Adding a RETURN; at the end of the shortcut simply stops the processing after the code stored in the shortcut finishes. BUT, if anything is highlighted when you execute a query shortcut, it is still parsed, even if not executed. Hence, you can still get parse errors even with the RETURN; added at the end.

Calling a Function

You can create the iTVF shown above in a utility / admin / common database, and then reference that in the shortcut as follows:

SELECT cp.* FROM [your_Utility_DB].dbo.ListAllUnicodeCodePoints(0) cp; RETURN;

You can add filtering and/or sorting as well. Just remember to keep everything on a single line:

SELECT cp.* FROM [your_Utility_DB].dbo.ListAllUnicodeCodePoints(0) cp WHERE cp.[CdPntINT] BETWEEN 0x1F000 AND 0x1F9FF; RETURN;

To store any variation of that in a query shortcut, do the following (in SSMS, of course):

Select the entire line from one of the boxes above, or from the query editor in SSMS
Hit Ctrl + C to copy the query to the clipboard
Switch to SSMS (if not already there)
Go to the “Tools” menu
Go to “Options…”
Go to “Environment”
Go to “Keyboard”
Go to “Query shortcuts”
Scroll to the end of the “Query shortcuts:” list
Click the cursor in the “Stored Procedure” column of the “Ctrl+0” row (or any row you want, really)
Hit Ctrl + V to paste the query into the text field
Change “your_Utility_DB” to the name of the database where you created the function
Click the “OK” button
PLEASE NOTE: the change will not affect any currently open tab. You will be able to use the shortcut starting with the next tab you open. To be clear: you do not need to restart SSMS for this change to take effect (like you do when changing the Grid Results font, for example)

No Function Needed

There might be situations in which you cannot guarantee the existence of a particular function or even database (such as your utility database). In that case, the query to return this list can be reduced to a single line so that it will work by itself as a query shortcut. By doing this, the query will not be dependent on the server you are connected to in order to function. Of course, this also means that if a change is to be made to the query and there are several people at the same place, all with this stand-alone query, then that change needs to be made in each person’s SSMS, as opposed to making the change one time to one function.

The following is a one-line representation of the query we placed into the ListAllUnicodeCodePoints function, plus the RETURN;:

DECLARE @EncodeSurrogateCodePointsInUTF8 BIT = 0; WITH nums AS (SELECT TOP (1114111) (ROW_NUMBER() OVER (ORDER BY @@MICROSOFTVERSION) - 1) AS [num] FROM [master].[sys].[all_columns] ac1 CROSS JOIN [master].[sys].[all_columns] ac2), chars AS (SELECT CONVERT(INT, n.[num]) AS [num], RIGHT(CONVERT(CHAR(6), CONVERT(BINARY(3), n.[num]), 2), CASE WHEN n.[num] > 65535 THEN 5 ELSE 4 END) AS [CodePointHex], CONVERT(INT, CASE WHEN n.[num] > 65535 THEN 55232 + (n.[num] / 1024) END) AS [HighSurrogateINT], CONVERT(INT, CASE WHEN n.[num] > 65535 THEN 56320 + (n.[num] % 1024) END) AS [LowSurrogateINT] FROM nums n WHERE n.[num] BETWEEN 0x000000 AND 0x014700 OR n.[num] BETWEEN 0x016800 AND 0x030000 OR n.[num] BETWEEN 0x0E0001 AND 0x0E01EF) SELECT 'U+' + c.[CodePointHex] AS [CodePoint], c.[num] AS [CdPntINT], '0x' + c.[CodePointHex] AS [CdPntBIN], CASE WHEN c.[num] > 65535 THEN NCHAR(c.[HighSurrogateINT]) + NCHAR(c.[LowSurrogateINT]) ELSE NCHAR(c.[num]) END AS [Char], CASE WHEN c.[num] > 65535 THEN CONVERT(CHAR(10), CONVERT(BINARY(4), NCHAR(c.[HighSurrogateINT]) + NCHAR(c.[LowSurrogateINT])), 1) ELSE CONVERT(CHAR(6), CONVERT(BINARY(2), NCHAR(c.[num])), 1) END AS [UTF-16LE ], '0x' + CASE WHEN c.[num] < 128 THEN CONVERT(CHAR(4), CONVERT(BINARY(1), c.[num]), 2) WHEN c.[num] BETWEEN 128 AND 2047 THEN CONVERT(CHAR(2), CONVERT(BINARY(1), ((c.[num] / 64) + 192)), 2) + CONVERT(CHAR(2), CONVERT(BINARY(1), ((c.[num] % 64) + 128)), 2) WHEN (@EncodeSurrogateCodePointsInUTF8 = 0) AND (c.[num] BETWEEN 55296 AND 57343) THEN 'EFBFBD' WHEN c.[num] BETWEEN 2048 AND 65535 THEN CONVERT(CHAR(2), CONVERT(BINARY(1), (((c.[num] / 64) / 64) + 224)), 2) + CONVERT(CHAR(2), CONVERT(BINARY(1), (((c.[num] / 64) % 64) + 128)), 2) + CONVERT(CHAR(2), CONVERT(BINARY(1), ((c.[num] % 64) + 128)), 2) WHEN c.[num] BETWEEN 65536 AND 1114111 THEN CONVERT(CHAR(2), CONVERT(BINARY(1), ((((c.[num] / 64) / 64) / 64) + 240)), 2) + CONVERT(CHAR(2), CONVERT(BINARY(1), ((((c.[num] / 64) / 64) % 64) + 128)), 2) + CONVERT(CHAR(2), CONVERT(BINARY(1), (((c.[num] / 64) % 64) + 128)), 2) + CONVERT(CHAR(2), CONVERT(BINARY(1), ((c.[num] % 64) + 128)), 2) ELSE CONVERT(VARCHAR(15), NULL) END AS [UTF-8 ], c.[HighSurrogateINT] AS [HighSrgtINT], c.[LowSurrogateINT] AS [LowSrgtINT], CONVERT(BINARY(2), c.[HighSurrogateINT]) AS [HighSrgtBIN], CONVERT(BINARY(2), c.[LowSurrogateINT]) AS [LowSrgtBIN], 'NCHAR(' + CASE WHEN c.[num] > 65535 THEN CONVERT(CHAR(6), CONVERT(BINARY(2), c.[HighSurrogateINT]), 1) + ') + NCHAR(' + CONVERT(CHAR(6), CONVERT(BINARY(2), c.[LowSurrogateINT]), 1) ELSE CONVERT(CHAR(6), CONVERT(BINARY(2), c.[num]), 1) END + ')' AS [T-SQL ], '&#x' + c.[CodePointHex] + ';' AS [HTML/XML ], CASE WHEN c.[num] < 65536 THEN '\u' + CONVERT(CHAR(4), CONVERT(BINARY(2), c.[num]), 2) ELSE CONVERT(VARCHAR(10), NULL) END AS [C#/F#/C++/Java[Script]]/Julia/?], CASE WHEN c.[num] > 65535 THEN '\u' + CONVERT(CHAR(4), CONVERT(BINARY(2), c.[HighSurrogateINT]), 2) + '\u' + CONVERT(CHAR(4), CONVERT(BINARY(2), c.[LowSurrogateINT]), 2) ELSE CONVERT(VARCHAR(15), NULL) END AS [C#/F#/Java[Script]]/?], '\U' + CONVERT(CHAR(8), CONVERT(BINARY(4), c.[num]), 2) AS [C#/F#/C/C++/Julia/?] FROM chars c; RETURN;

To store that in a query shortcut, do the following (in SSMS, of course):

Copy the query shown above:
- Click the “Copy Query to Clipboard” button
  
  OR
- Select the entire query in the box above (triple-click should select all)
- Hit Ctrl + C to copy the query to the clipboard
Switch to SSMS
Go to the “Tools” menu
Go to “Options…”
Go to “Environment”
Go to “Keyboard”
Go to “Query shortcuts”
Scroll to the end of the “Query shortcuts:” list
Click the cursor in the “Stored Procedure” column of the “Ctrl+0” row (or any row you want, really)
Hit Ctrl + V to paste the query into the text field
Click the “OK” button
PLEASE NOTE: the change will not affect any currently open tab. You will be able to use the shortcut starting with the next tab you open. To be clear: you do not need to restart SSMS for this change to take effect (like you do when changing the Grid Results font, for example)

Viewing Unicode Characters

Working with Unicode can sometimes be a little confusing because it is not all-or-none. Unicode is a specification, not software, so how it behaves depends on how a vendor implemented (or did not implement) any part of the specification. On top of that, there are different versions, and newer versions not only include new characters, but they sometimes make changes / correct sort weights and categorizations from previous versions. So, here are some things to keep in mind when looking through the list generated by ListAllUnicodeCodePoints().

How a code point behaves and how it appears are controlled by different systems. Behavior — comparison, sorting, uppercase / lowercase mappings — is controlled by SQL Server, in most cases by a collation. Newer versions of collations generally contain more definitions to guide those behaviors. The categorization of characters that are and are not valid for identifiers (i.e. names of any object in the database) is based on Unicode 3.2 and does not change across versions of SQL Server, or collation, or anything. (please see: “The Uni-Code: The Search for the True List of Valid Characters for T-SQL Regular Identifiers”, Part 1 and Part 2).

However, how a code point appears is controlled by fonts, and has nothing to do with SQL Server. Fonts get updated, and new fonts get created, much more frequently than new collation versions get released. Many code points have been added, and some updated, since any collation version was released, and there is no way to access any of those new or updated definitions until a new collation version is released. But, you are still able to see newly defined code points as long as the font you are using has incorporated the new definition(s). Or, at least there is the possibility of seeing new code points.

The other tricky part of this puzzle is that fonts don’t contain definitions for all code points. I don’t believe it is even possible that they could. Fonts, I believe, are limited to 65,536 characters, and there are at least twice as many code points defined now. Some fonts have very few characters defined. To get a better sense of this, take a look at this page on fontspace.com that lists Unicode characters (page 313). On the right side it shows how many fonts include each particular character. The range on that page appears to be 2 – 19.

What all of this means is that, for any code point, if you see a character and not just a square box, then it is clearly defined. For any code point that appears as a square box, it might be undefined (by Unicode), or it might simply be missing from the font that you are using. And keep in mind that sometimes fonts don’t define certain characters correctly, or sometimes they might appear correct, but they might have combining properties that do not behave correctly, and thus do not render as expected when used with other characters.

That being said, I have had the best luck with the Code2003 font by James Kass. That font claims to map over 65,000 characters, and they look a lot better than the GNU Unifont characters. I use that font in my Grid Results:

Download the font
Install the font
Switch to SSMS
Go to the “Tools” menu
Go to “Options…”
Go to “Environment”
Go to “Fonts and Colors”
In the “Show settings for:” drop-down, select “Grid Results”
In the “Font” drop-down, select “Code2003”
Click the “OK” button
This change does require exiting and restarting SSMS before you will see the change.

Keep in mind, if you see a character that you want to use, and then copy it to the clipboard and paste it somewhere (such as the query editor, or an email, or Notepad++, etc) and it shows up as a square box, that only means that the character isn’t in the font used by that other program. But, the character is still the character that you had copied, even if you can’t see it; the byte sequence is correctly (assuming that you are in a place that can handle Unicode and isn’t 8-bit / ASCII Extended only, although there you would probably just get a "?" instead).

End Notes

The script to create this function, along with the tests, can be found on PasteBin.com: T-SQL Function (iTVF) to List All Unicode Code Points.

This is just a temporary Phase 1 of a much larger project that I am working on. This function will eventually be replaced by something even better.

But for now, enjoy…. 😎 🙃

P.S. For more info (including additional languages) regarding Unicode escape sequences, please see:
Unicode Escape Sequences Across Various Languages and Platforms (including Supplementary Characters)

Regarding the number of code points that have been allocated, there are several numbers to choose from, depending on what you consider to be “allocated”. There are code points that are regular characters, some are formatting, some are control characters, some are surrogates (which are only used to encode Supplementary Characters in UTF-16), some are private use (never to be assigned meaning by Unicode so you can assign your own meaning to them), etc.
As of Unicode version 13.0 (published 2020-03-10), there are 283,506 “Total Designated” code points, across all categorizations / usages. Of those, 137,468 are considered “Private Use”. For my estimated number of allocated code points, I simply deducted the “Private Use” total from the “Total Designated” (since you will rarely, if ever, encounter the private use code points). That gives us: 146,038. I kept in the surrogate code points, control characters, etc as those are code points you are likely to deal with. Basically, there are about 146k code points that have been assigned meaning by Unicode. ↩
Unicode support was introduced in SQL Server 7.0, which was released on 1998-11-27. At this time, Unicode had defined UTF-16 (in version 2.0, published in July of 1996), but no supplementary characters were defined until version 3.1, which was published two and a half years later on 2001-05-16. Hence, there was really no way to incorporate support for UTF-16 / supplementary characters as there were no supplementary characters to test with. And, this was still the case when SQL Server 2000 was released. ↩

6 thoughts on “SSMS Tip #3: Easily Access/Research ALL Unicode Characters (Yes, Including Emojis 😸)”

Fun with Emoji in SSMS – Curated SQL March 27th, 20198:00 AM Reply

[…] Solomon Rutzky shares a method to generate any Unicode character in SQL Server Management Studio: […]

Loading...
Unicode, Emojis and Databases Oh My! | SQL RNNR January 5th, 20207:06 PM Reply

[…] lifting for this when he wrote a function that will do exactly what I was looking to do – here. The function that Solomon created will return all 188,657 code points that are usable in SQL […]

Loading...
John Nelson #2 April 30th, 202111:00 AM Reply

When I download the entirety of unicode characters, including Emoji, from https://www.unicode.org/Public/UCD/latest/ucdxml/ucd.all.flat.zip I get 143927 characters. Your function returns 188657 characters…Where are the additional 44730 coming from?

Loading...
1. Solomon Rutzky June 4th, 20213:04 PM Reply
  
  Hello John. Good question. I just added a paragraph (2nd in the “For Viewing All Code Points / Filtering / Sorting on Something Besides the Code Point” section) to explain those extra rows / characters / code points. Please let me know if that explanation does or doesn’t make sense. Thanks and take care, Solomon…
  
  Loading...

Loving this article. I keep coming back to it for reference over the years.

But now it’s 2022 and I’m seeing these additional entries in the UCD 15.0

CODE_POINT	NAME	CATEGORY
F0000	“Plane 15 Private Use, First”	Co
FFFFD	“Plane 15 Private Use, Last”	Co
100000	“Plane 16 Private Use, First”	Co
10FFFD	“Plane 16 Private Use, Last”	Co

So those don’t show up in this function because…they aren’t actual glyphs yet? They just exist in the standard but aren’t real “things” or?

Solomon Rutzky September 27th, 20225:23 PM Reply

Hi John. Good to hear from you again. Thanks for kind words and feedback, both are appreciated 😺.

As far as the Private Use ranges for planes 15 and 16 go, they have actually been defined since Unicode version 3.0.1, published in 2000. Those code points are filtered out of the code in this post as I figured that for most people / use-cases, those would be undefined and hence only contribute a lot of always-empty rows in the result set that would make it take longer to generate, use up more memory, and make it harder to scroll through the rows with defined code points. However, if you have a need/desire to see them, then by all means please adjust the WHERE clause accordingly. Over time I will have to update it anyway as more code points are defined, but I think I will still default to filtering out the private use ranges.

Loading...

SSMS Tip #3: Easily Access/Research ALL Unicode Characters (Yes, Including Emojis 😸)

The Problem

Solutions: Old and New

The Code

For Viewing All Code Points / Filtering / Sorting on Something Besides the Code Point

Tests

Use directly as a Keyboard Query shortcut

Calling a Function

No Function Needed

Viewing Unicode Characters

End Notes

Like this:

6 thoughts on “SSMS Tip #3: Easily Access/Research ALL Unicode Characters (Yes, Including Emojis 😸)”

Leave a ReplyCancel reply

The Problem

Solutions: Old and New

The Code

For Viewing All Code Points / Filtering / Sorting on Something Besides the Code Point

Tests

Use directly as a Keyboard Query shortcut

Calling a Function

No Function Needed

Viewing Unicode Characters

End Notes

Share this awesome post:

Like this:

6 thoughts on “SSMS Tip #3: Easily Access/Research ALL Unicode Characters (Yes, Including Emojis 😸)”

Leave a ReplyCancel reply

Discover more from Sql Quantum Leap