Unicode Escape Sequences Across Various Languages and Platforms (including Supplementary Characters)


(last updated: 2020-01-10 @ 14:15 EST / 2020-01-10 @ 19:15 UTC )

ATTENTION SQL SERVER CENTRAL READERS:
If the formatting below does not look correct, then please view the original post at:
https://SqlQuantumLeap.com/2019/06/26/unicode-escape-sequences-across-various-languages-and-platforms-including-supplementary-characters/
 
 
 
For convenience, you can easily navigate to this page using the following short URL:

https://bit.ly/UnicodeEscapeSequences

 

I often need to include Unicode-only characters in my scripts, posts, etc., and have found that including such characters directly can sometimes lead to problems when there are encoding “issues”. So, as much as possible I try to escape all Code Points above U+007F (value 127 in decimal), leaving me with a highly transportable / mostly risk-free document. But, this means that I need to know how to escape Unicode characters in various languages. After looking through the documentation for a number of languages and platforms, I have noticed that the descriptions can sometimes be misleading or at least unclear, and the examples, if any are provided, are nearly always showing standard ASCII characters such as an uppercase US English “A”. Very few show Unicode-only BMP Code Points, and even fewer show how to escape Supplementary Characters. Not showing examples of escaping Supplementary Characters is a problem because they can be trickier to escape, especially if the documentation is incomplete or misleading.

The purpose of this post is to correct the overall lack of examples. Everything shown below are actual working examples of creating both a Unicode-only BMP character (meaning a non-Supplementary Character that would require Unicode) and a Supplementary Character. Most examples include a link to an online demo, either on db<>fiddle (for database demos) or IDE One (for non-database demos), both very cool and handy sites.

I use the same two characters across all examples to hopefully make them all easier to understand. Those two characters are:

Unicode-only BMP Character

Tibetan Mark Gter Yig Mgo -Um Rnam Bcad Ma ( U+0F02 )
Encoding Binary /
hex value
Integer /
decimal value
UTF-8 {Bytes} E0,BC,82 224,188,130
UTF-16 LE (Little Endian) {Bytes} 020F N / A
UTF-16 / UTF-16 BE (Big Endian) {Code Units} 0F02 3842
UTF-32 {Code Point} 000F02 3842

Supplementary Character

Alien Monster ( U+1F47E ) 👾
Encoding Binary /
hex value
Integer /
decimal value
UTF-8 {Bytes} F0,9F,91,BE 240,159,145,190
UTF-16 LE (Little Endian) {Bytes} 3DD8,7EDC N / A
UTF-16 / UTF-16 BE (Big Endian) {Code Units} D83D,DC7E 55357,56446
UTF-32 {Code Point} 01F47E 128126

This post will be updated in the near future to include additional platforms and languages, such as: Oracle, DB2, R, Python, and VB.NET.






 

HTML, XHTML, and XML

  • &#DD; for Code Points in decimal notation (“DD” is a decimal value between 1 and 1114111 )
  • &#xHHHHHH; for Code Points in hex notation (“HH” = a hex value between 1 and 10FFFF )
    • In XML, the “x” is required to be lower-case (e.g. &#X123; is invalid in XML, but valid in HTML).
Decimal: &#3842;
Hex:     &#x0F02;


Decimal: &#128126;
Hex:     &#x1F47E;



 

Microsoft SQL Server (T-SQL)

SQL Server technically does not have character escape sequences, but you can still create characters using either byte sequences or Code Points using the CHAR() and NCHAR() functions. We are only concerned with Unicode here, so we will only be using NCHAR().

  • All versions:
    • NCHAR(0 - 65535) for BMP Code Points (using an int/decimal value)
    • NCHAR(0x0 - 0xFFFF) for BMP Code Points (using a binary/hex value)
    • NCHAR(0 - 65535) + NCHAR(0 - 65535) for a Surrogate Pair / Two UTF-16 Code Units
    • NCHAR(0x0 - 0xFFFF) + NCHAR(0x0 - 0xFFFF) for a Surrogate Pair / Two UTF-16 Code Units
    • CONVERT(NVARCHAR(size), 0xHHHH) for one or more characters in UTF-16 Little Endian (“HHHH” is 1 or more sets of 4 hex digits)
  • Starting in SQL Server 2012:
    • If database’s default collation supports Supplementary Characters (collation name ends in _SC, or starting in SQL Server 2017 name contains _140_ but does not end in _BIN*, or starting in SQL Server 2019 name ends in _UTF8 but does not contain _BIN2), then NCHAR() can be given Supplementary Character Code Points:
      • decimal value can go up to 1114111
      • hex value can go up to 0x10FFFF
  • Starting in SQL Server 2019:
    • _UTF8” collations enable CHAR and VARCHAR data to use the UTF-8 encoding:
      • CONVERT(VARCHAR(size), 0xHH) for one or more characters in UTF-8 (“HH” is 1 or more sets of 2 hex digits)
      • NOTE: The CHAR() function does not work for this purpose. It can only produce a single byte, and UTF-8 is only a single byte for values 0 – 127 / 0x00 – 0x7F.

All versions of SQL Server (at least since 2005, if not earlier):

SELECT N'T' + NCHAR(9) + N'A' + NCHAR(0x9) + N'B' AS [Single Decimal
or Hex Digit],

       NCHAR(0xF02) AS [Code Point (from hex)],
       NCHAR(3842) AS [Code Point (from decimal)],

       -- We are passing in "values", _not_ "escape sequences"
       NCHAR(0x0000000000000000000000F02) AS [BINARY / hex "value"],
       NCHAR(0003842.999999999) AS [INT / decimal "value"];



-- The following syntaxes work regardless of the database's collation:
SELECT NCHAR(0xD83D) + NCHAR(0xDC7E) AS [UTF-16 Surrogate Pair (BINARY/hex)],
       NCHAR(55357) + NCHAR(56446) AS [UTF-16 Surrogate Pair (INT/decimal)],
       CONVERT(NVARCHAR(10), 0x3DD87EDC) AS [UTF-16LE bytes];

Starting with SQL Server 2012:

-- The following syntax only works if the database's default collation
--   supports Supplementary Characters (starting in SQL 2012), else the
--   NCHAR() function returns NULL:
SELECT NCHAR(0x1F47E) AS [UTF-32 (BINARY / hex)],
       NCHAR(128126) AS [UTF-32 (INT / decimal)];

Starting with SQL Server 2019:

-- Works if current database has a "_UTF8" default collation:
SELECT CONVERT(VARCHAR(10), 0xF09F91BE); -- UTF-8 bytes

-- Works regardless of database's default collation:
DECLARE @Temp TABLE
(
  [TheValue] VARCHAR(10) COLLATE Latin1_General_100_CI_AS_SC_UTF8 NOT NULL
);

INSERT INTO @Temp ([TheValue]) VALUES (0xF09F91BE); -- UTF-8 bytes

SELECT * FROM @Temp;

See SQL Server 2017 demo on db<>fiddle


See SQL Server 2019 / UTF-8 demo on db<>fiddle

Also see:




 

MySQL

There is no Unicode character escape according to the “Special Character Escape Sequences” section of the String Literals documentation. And I did try the usual ones: \x, \X, \u, \U, and \U{}.

However, you could just use a hex literal. The Hexadecimal Literals documentation states:

  • Values written using X'val' notation must contain an even number of digits or a syntax error occurs. To correct the problem, pad the value with a leading zero
  • Values written using 0xval notation that contain an odd number of digits are treated as having an extra leading 0. For example, 0xaaa is interpreted as 0x0aaa.

The other option is the CHAR() function which has an optional using clause for specifying the encoding.

  • _utf8mb4 0xHH for UTF-8 bytes (“HH” is 1 or more hex digits)
  • _utf8mb4 X'HH' (“HH” is an even number of hex digits)
  • _utf32 0xHH for Code Point / UTF-32 (“HH” is 1 or more hex digits)
  • _utf16 0xHH for UTF-16 (implied Big Endian ; “HH” is 1 or more hex digits)
  • _utf16le 0xHH for UTF-16 Little Endian (“HH” is 1 or more hex digits)
  • CHAR(0xHH USING encoding) (encoding name is not prefixed with an underscore “_” here!)
  • The “utf8” encoding can only handle BMP characters (i.e. 1 – 3 bytes per character)
  • The “utf8mb4” encoding can handle all Unicode character, BMP and Supplementary Characters (i.e. 1 – 4 bytes per character)
  • The 0xHH notation seems more convenient since it assumes leading zeros, so you can specify 0x1F47E instead of 0x01F47E, and it’s more consistent with most other languages / platforms.
  • The options shown here are not true escape sequences. They are series of bytes, allowing you to specify multiple characters in a single sequence. For example, the following all produce two characters, “AB”:
    • _utf8 0x4142
    • _utf16 0x00410042
    • CHAR(0x4142 USING utf8)
    • CHAR(0x00410042 USING utf16)

Two different HEX notations:

SELECT _utf8mb4 0xF09F91BE AS "UTF-8 bytes in 0x notation",
       _utf8mb4 X'F09F91BE' AS "UTF-8 bytes in X'' notation",

       _utf32 0x1F47E AS "Code Point in 0x notation",
       _utf32 X'01F47E' AS "Code Point in X'' notation";

Introducers:

# BMP Character ( U+0F02  ):
SELECT _utf8 0xE0BC82,    # 3-byte (BMP-only) UTF-8
       _utf8mb4 0xE0BC82, # Full UTF-8
       _utf16 0xF02,      # UTF-16 (implied Big Endian)
       _utf16le 0x020F,   # UTF-16 Little Endian
       _utf32 0xF02;      # Code Point / UTF-32


# Supplementary Character ( U+1F47E ):
SELECT _utf16 0xD83DDC7E,   # UTF-16 (implied Big Endian) Surrogate Pair
       _utf16le 0x3DD87EDC, # UTF-16 Little Endian Surrogate Pair
       _utf32 0x1F47E;      # Code Point / UTF-32

CHAR() function:

# CHAR(0xHEX USING encoding) function:
SELECT CHAR(0xF09F91BE USING utf8mb4), # UTF-8 bytes
       CHAR(0xD83DDC7E USING utf16),   # UTF-16 (Big Endian) Surrogate Pair
       CHAR(0x3DD87EDC USING utf16le), # UTF-16 Little Endian Surrogate Pair
       CHAR(0x0001F47E USING utf32),   # Code Point / UTF-32
       CHAR(0x1F47E USING utf32);      # Code Point (implied leading zeros)

See MySQL 8.0 demo on db<>fiddle

See request to add capability of using U&'' escape syntax (same as what PostgreSQL uses): WL#3529: Unicode Escape Sequences (original request linked at the bottom of the “High Level Architecture” tab, BUG 10199)




 

PostgreSQL

  • \xHH (“HH” is 1 – 2 hex digits: \xH, \xHH; value between 1 and FF )
  • \uHHHH for a BMP Code Point (“HHHH” is always 4 hex digits; value between 0001 and FFFF )
  • \uHHHH\uHHHH for a Surrogate Pair / Two UTF-16 Code Units
  • \U00HHHHHH for the Code Point / UTF-32 bytes (“HHHHHH” is always 6 hex digits; value between 000001 and 10FFFF )
  • U&'\HHHH' for a BMP Code Point (“HHHH” is always 4 hex digits; value between 0001 and FFFF )
  • U&'\+HHHHHH' for any Code Point / UTF-32 bytes (“HHHHHH” is always 6 hex digits; value between 000001 and 10FFFF )

Also, the “String Constants With C-Style Escapes” and “String Constants With Unicode Escapes” sections of Lexical Structure documentation states:

  • The Unicode escape syntax works fully only when the server encoding is UTF8.
  • When surrogate pairs are used when the server encoding is UTF8, they are first combined into a single code point that is then encoded in UTF-8.
  • Also, the U&'\xxxx' Unicode escape syntax for string constants only works when the configuration parameter standard_conforming_strings is turned on… If the parameter is set to off, this syntax will be rejected with an error message.
SELECT E'TAB\x9TAB' AS "Single Byte", E'\xF0\x9F\x91\xBE' AS "UTF-8 bytes";

SELECT E'\u0F02' AS "Code Point",
       E'\uD83D\uDC7E' AS "UTF-16 Surrogate Pair",
       E'\U0000D83D\U0000DC7E' AS "UTF-16 Surrogate Pair via UTF-32",
       E'\U0001F47E' AS "UTF-32";

SELECT E'\U0010FFFF' AS "Highest UTF-32 Code Point";

SELECT U&'\0F02' AS "Code Point",
       U&'\D83D\DC7E' AS "UTF-16 Surrogate Pair",
       U&'\+00D83D\+00DC7E' AS "UTF-16 Surrogate Pair via UTF-32",
       U&'\+01F47E' AS "UTF-32";

See PostgreSQL 11 demo on db<>fiddle




 

C#

C# is a Microsoft .NET language.

The “String Escape Sequences” section of the Strings (C# Programming Guide) documentation states:

  • \xHHHH (“HHHH” is 1 – 4 hex digits: \xH, \xHH, \xHHH, or \xHHHH; value between 1 and FFFF )
    • WARNING: be careful when specifying less than 4 hex digits. If the characters that immediately follow the escape sequence are valid hex digits, they will be interpreted as being part of the escape sequence. Meaning, \xA1 produces “¡”, but if the next character is “A” (or “a”), then it will instead be interpreted as being \xA1a and produce “ਚ”, which is Code Point U+0A1A. In such cases, specifying all 4 hex digits (e.g. \x00A1 ) would solve the problem. See “Warning” example block below.
  • \uHHHH for BMP character (“HHHH” is always 4 hex digits; value between 0001 and FFFF )
  • \uHHHH\uHHHH or \xHHHH\xHHHH or \uHHHH\xHHHH or \xHHHH\uHHHH for a Surrogate Pair / Two UTF-16 Code Units to create a\ Supplementary Character
  • \U00HHHHHH for Code Point / UTF-32 (“HHHHHH” is always 6 hex digits; value between 000001 and 10FFFF )
    • The documentation incorrectly states that this syntax is for Surrogate Pairs. I will submit a correction for that. (please see “Documentation Improvements and/or Corrections” just below example code)
    • In creating a test case to prove that the \U escape does not handle surrogate pairs, I found a bug in Mono: if the first 4 hex digits are in the range of 0x8000 – 0xFFFF then they are completely ignored and the last 4 hex digits are processed as if the first four digits were specified as being 0x0000 (i.e. as a regular UTF-16 code unit). I submitted an issue for this: “\U” Unicode escape sequence for strings accepts invalid value instead of raising error #15456.
Console.WriteLine(
    "One to Four hex digits via \\x: W\x9W, X\x09X, Y\x009Y, Z\x0009Z");
Console.WriteLine("");
Console.WriteLine("Always four hex digits via \\u: TAB\u0009TAB");
Console.WriteLine("");

Console.WriteLine("Unicode-only BMP character: (\\x) \x0F02  (\\u) \u0F02");
Console.WriteLine("");

Console.WriteLine(
    "Two UTF-16 Code Units (i.e. Surrogate Pair) via \\x: \xD83D\xDC7E");
Console.WriteLine(
    "Two UTF-16 Code Units (i.e. Surrogate Pair) via \\u: \uD83D\uDC7E");
Console.WriteLine("");

Console.WriteLine("Code Point / UTF-32 via \\U: \U00000F02");
Console.WriteLine("Code Point / UTF-32 via \\U: \U0001F47E");
Console.WriteLine("");

Console.WriteLine("Highest Code Point / UTF-32 via \\U: \U0010FFFF");

WARNING: be care when using \x with less than 4 hex digits:

Console.WriteLine("-------------------");

Console.WriteLine("\\xA1 followed by a ...");
Console.WriteLine("..non-alphanumeric character ([space]): \xA1 A");
Console.WriteLine("..non-hex digit (Z): \xA1Z");
Console.WriteLine(
    "..hex digit, but intended to be used as itself (A): \xA1Ay, caramba!");
// \xA1Ay returns "ਚy" instead of "¡Ay" because \xA1A produces U+0A1A

Console.WriteLine(
    "\\x00A1 followed by a hex digit (A): \x00A1Aye aye, Captain!");

See C# demo on “IDE One”

Documentation Improvements and/or Corrections:




 

F#

F# is a Microsoft .NET language.

See the “Remarks” section of the Strings documentation.

  • \DDD for decimal byte notation (“DDD” is always 3 decimal digits; value between 000 and 255 )
    • This escape is effectively ISO-8859-1 (first 256 characters are the same as Unicode)
    • Technically, value can go up to 999, but resulting character is determined by DDD % 256 (where % is modulus operator)
  • \xHH for hex byte notation (“HH” is always 2 hex digits; value between 01 and FF )
    • NOTE: this escape is not documented. Not sure if that is oversight or intentional. (please see “Documentation Improvements and/or Corrections” just below example code)
    • This escape is effectively ISO-8859-1 (first 256 characters are the same as Unicode)
    • Output is still UTF-16 (leading “00” is implied: \x41 is really \u0041)
  • \uHHHH for BMP character (“HHHH” is always 4 hex digits; value between 0001 and FFFF )
  • \uHHHH\uHHHH for a Surrogate Pair / Two UTF-16 Code Units to create a Supplementary Character
  • \U00HHHHHH for the Code Point / UTF-32 bytes (“HHHHHH” is always 6 hex digits; value between 000001 and 10FFFF )
    • The documentation (for “Literals”) incorrectly states that this syntax is for Surrogate Pairs. I will submit a correction for that. (please see “Documentation Improvements and/or Corrections” just below example code)
printfn "UNDOCUMENTED Decimal (NOT Octal) \\DDD requires 3 digits: TAB\9TAB\09TAB\009TAB";
printfn "\\DDD notation is ISO-8859-1 (U+0000 - U+00FF): {\128-\129-\144-\152-\160-\161}";
printfn "CHAR for \\DDD = (DDD %% 256); Max = \\999 (U+00E7): {\365-\621-\6210-\176-\100-\999-\1000}";
printfn "---------------------";

printfn "UNDOCUMENTED \\x only works with two hex digits: TAB\x9TAB\x090TAB";
printfn "\\x is ISO-8859-1: 0x80 = \x80, 0x81 = \x81, 0x90 = \x90, 0x9A = \x9A, 0x9F = \x9F";
printfn "\\x is _not_ creating UTF-8: \xE0\xBC\x82"; // UTF-8 bytes for U+0F02
printfn "---------------------";

printfn "UTF-16 via \\u: \u0F02"; // ?
printfn "UTF-16 Surrogate Pair via \\u: \uD83D\uDC7E"; // U+1F47E
printfn "---------------------";

printfn "Code Point / UTF-32 via \\U: \U00000F02"; // ?
printfn "Code Point / UTF-32 via \\U: \U0001F47E";

See F# demo on “IDE One”

Documentation Improvements and/or Corrections:




 

Microsoft Visual C++ / C-Style

The “Escape Sequences” and “Universal character names” sections of the String and Character Literals (C++) documentation states:

  • \888 for an encoding-dependent character (“888” is 1 – 3 octal digits [0 – 7]; value between 0 and 777 )
  • \xHHHH for an encoding-dependent character (“HHHH” is 1 – 4 hex digits: \xH, \xHH, \xHHH, or \xHHHH; value between 0 and FFFF )
    • WARNING: be careful when specifying less than 4 hex digits. If the characters that immediately follow the escape sequence are valid hex digits, they will be interpreted as being part of the escape sequence. Meaning, \xA1 produces “¡”, but if the next character is “A” (or “a”), then it will instead be interpreted as being \xA1a and produce “ਚ”, which is Code Point U+0A1A. In such cases, specifying all 4 hex digits (e.g. \x00A1 ) would solve the problem. See “Warning” example block below.
  • \uHHHH (“HHHH” is always 4 hex digits; value between 0000 and FFFF )
  • \U00HHHHHH for Code Point / UTF-32 (“HHHHHH” is always 6 hex digits; value between 000001 and 10FFFF )
  • NOTE: Neither \xHHHH\xHHHH nor \uHHHH\uHHHH can be used to represent a Surrogate Pair (i.e. two UTF-16 code units)
#include "stdafx.h"
#include <iostream>

int main()
{
    // In Command Prompt, run the following first to get this console app to return values:
    // CHCP 65001

    std::wcout << u8"\\11 and \\011: tab\11tabby\011tab" << u8"\n";
    std::wcout << u8"\\7, \\07, and \\007: bell\7bell\07bell\007bell" << u8"\n";
    std::wcout << u8"\\176 = \176 ; \\177 = \177 ; \\200 = \200 ; \\237 = \237" << u8"\n";
    std::wcout << u8"\\242 = \242 ; \\377 = \377 ; \\777 = \777" << u8"\n"; // \777 == \u01FF
    std::wcout << u8"-------------------------------" << u8"\n";


    std::wcout << u8"\\x works with 1 or 2 hex digits: TAB\x9TAB\x09TAB" << u8"\n";
    std::wcout << u8"\\x works with 3 or 4 hex digits: Yadda\xA1Yadda\xA1AYadda\xA1AAYadda" << u8"\n";
    std::wcout << u8"\\x is ISO-8859-1: 0x80 = \x80, 0x81 = \x81, 0x90 = \x90, 0x9A = \x9A, 0x9F = \x9F" << u8"\n";
    std::wcout << u8"\\x is _not_ creating UTF-8: \xE0\xBC\x82" << u8"\n"; // UTF-8 bytes for U+0F02
    std::wcout << u8"-------------------------------" << u8"\n";


    std::wcout << u8"BMP Code Point / UTF-16 via \\u: \u0F02" << u8"\n";
    //std::wcout << L"UTF-16 Surrogate Pair via \\u: \uD83D\uDC7E" << L"\n"; // U+1F47E // compile error
    //std::wcout << u"UTF-16 Surrogate Pair via \\u: \uD83D\uDC7E" << u"\n"; // U+1F47E // compile error
    //std::wcout << u8"UTF-16 Surrogate Pair via \\u: \uD83D\uDC7E" << u8"\n"; // U+1F47E // compile error
    std::wcout << u8"-------------------------------" << u8"\n";


    std::wcout << u8"Code Point / UTF-32 via \\U: \U00000F02" << u8"\n";
    std::wcout << u8"Code Point / UTF-32 via \\U: \U0001F47E" << u8"\n";
    std::wcout << u8"Code Point / UTF-32 via \\U: \U0010FFFF" << u8"\n";
    //std::wcout << u8"Code Point / UTF-32 via \\U: \U00110000" << u8"\n";  // compile error
    std::wcout << u8"-------------------------------" << u8"\n";

    return 0;
}

I could not get the example code shown above to run on “IDE One”, but it did work as expected when compiled in Visual Studio, as a console app, and run from a Command Prompt.

NOTE: Be sure to run the following in a Command Prompt first if you are going to run the example shown above (it sets the code page to UTF-8):

C:\>CHCP 65001




 

C

  • 1 to 4 \xHH for UTF-8 bytes (or whatever encoding the system is using; “HH” is 1 – 2 hex digits: \xH, \xHH; value between 1 and FF )
  • \U00HHHHHH for the Code Point / UTF-32 bytes (“HHHHHH” is always 6 hex digits; value between 000001 and 10FFFF )
printf("\\x can escape a single hex digit: TAB\x9TAB");
printf("\n\n");

printf("Three UTF-8 bytes via \\x: \xE0\xBC\x82");    // U+0F02
printf("\n");
printf("Four UTF-8 bytes via \\x: \xF0\x9F\x91\xBE"); // U+1F47E

printf("\n\n");

printf("The \\U syntax requires 8 hex digits (first two are always 0):\n");
printf("Code Point / UTF-32 via \\U: \U00000F02");
printf("\n");
printf("Code Point / UTF-32 via \\U: \U0001F47E");

See C demo on “IDE One”




 

PHP

The “Double quoted” section of the String documentation states that you can use the following sequences in double quoted, not single quoted, strings:

  • All versions of PHP
    • 1 to 4 \888 for single byte / UTF-8 code unit (“888” is 1 – 3 octal digits [0 – 7]; value between 0 and 377 )
      • Values 0 – 177 (and 400 – 577) map directly to standard ASCII characters of those same byte values.
      • Values 200 – 377 (and 600 – 777) are only valid for constructing proper UTF-8 encodings of characters.
      • Technically, values 400 – 777 are accepted, but they merely equate to that value minus 400 (e.g. 400 == 0, 567 == 167, and 777 == 377). Using these values might result in a warning being thrown (e.g. “PHP Warning: Octal escape sequence overflow \476 is greater than \377”).
    • 1 to 4 \xHH for single byte / UTF-8 code unit (“HH” is 1 – 2 hex digits; value between 0 and FF )
      • Values 0 – 7F map directly to standard ASCII characters of those same byte values.
      • Values 80 – FF are only valid for constructing proper UTF-8 encodings of characters.
  • Starting in PHP 7.0.0
    • \u{HHHHHH} for the Code Point / UTF-32 bytes (“HHHHHH” is 1 – 6 hex digits; value between 0 and 10FFFF )

All versions of PHP:

echo "PHP version: ".phpversion()."\n\n";

echo "The following should work in all PHP versions:\n";
echo "\\x can escape a single hex digit: TAB\x9TAB";
echo "\n\n";

echo "Three UTF-8 bytes via \\x: \xE0\xBC\x82";    # U+0F02
echo "\n";
echo "Four UTF-8 bytes via \\x: \xF0\x9F\x91\xBE"; # U+1F47E
echo "\n\n";

echo "Octal notation is \\888 where '888' = 1 - 3 octal digits (values 0 - 7; range 0 - 377):";
echo "\n";
echo "\\11 and \\011: tab\11tabby\011tab";
echo "\n";
echo "\\7, \\07, and \\007: bell\7bell\07bell\007bell";
echo "\n";
echo "\\076 = \076 ; \\176 = \176 ; \\476 = \476 ; \\576 = \576";
echo "\n";
echo "UTF-8 bytes for U+0F02: \\340\\274\\202: \340\274\202";
echo "\n";
echo "UTF-8 bytes for U+1F47E: \\360\\237\\221\\276: \360\237\221\276";
echo "\n";
echo "UTF-8 bytes for U+1F47E: \\760\\637\\621\\676: \760\637\621\676";
echo "\n\n";

Starting in PHP 7.0.0:

echo "The following should work starting in PHP version 7.0.0:\n";
echo "Code Point / UTF-32 via \\u{}: \u{0F02}";
echo "\n";
echo "Code Point / UTF-32 via \\u{}: \u{1F47E}";

See PHP demo on “IDE One”

More info on the “\u{}” syntax




 

JavaScript

The “Escape notation” section of the String global object documentation states that you can use the following sequences in both double quoted and single quoted strings:

  • All versions of JavaScript
    • \888 for ISO-8859-1 character (“888” is 1 – 3 octal digits [0 – 7]; value between 0 and 377 )
    • \xHH for ISO-8859-1 character (“HH” is always 2 hex digits; value between 00 and FF )
    • \uHHHH (“HHHH” is always 4 hex digits; value between 0000 and FFFF )
    • fromCharCode(NNN [, NNN, ...]) function for UTF-16 (“NNN” is 1 or more integer and/or hex UTF-16 values; value between 0 and 65535 / 0xFFFF ). Specify Surrogate Pairs to create Supplementary Characters. See documentation.
      • The documentation incorrectly states that this function cannot create Supplementary Characters. I will submit a correction for that. (please see “Documentation Improvements and/or Corrections” just below example code)
  • Newer versions?
    • fromCodePoint(NNN [, NNN, ...]) function for Code Point / UTF-32 (“NNN” is 1 or more integer and/or hex Code Points / UTF-32 values; value between 0 and 1114111 / 0x10FFFF ). See documentation
      • While updating the documentation for this function, I discovered a bug in the documentation editor in that it crashes with a “500 Internal Server Error” when saving if there are any supplementary characters. Oops. (please see “Documentation Improvements and/or Corrections” just below example code)
      • I cannot get this function to work on either of the JavaScript versions on IDEOne.com, but…
      • The following does work in my browser (Chrome): alert("String.fromCodePoint(0x1F47E) = " + String.fromCodePoint(0x1F47E));
  • Coming Soon
    • \u{HHHHHH} for the Code Point / UTF-32 code unit (“HHHHHH” is 1 – 6 hex digits; value between 0 and 10FFFF )
      • Documentation states: “This is an experimental API that should not be used in production code”
      • I cannot get this syntax to work on either of the JavaScript versions on IDEOne.com, but…
      • The following does work in my browser (Chrome): alert("\\u{1F47E} = \u{1F47E}");
// \x9 throws an error when using JavaScript (SMonkey 24.2.0).
print("\\x only works with two hex digits: TAB\x9TAB\x090TAB");
print("\\x is ISO-8859-1: 0x80 = \x80, 0x81 = \x81, 0x90 = \x90, 0x9A = \x9A, 0x9F = \x9F");
print("\\x is _not_ creating UTF-8: \xE0\xBC\x82"); // UTF-8 bytes for U+0F02
print("");

print("BMP Code Point / UTF-16 via \\u: \u0F02");
print("UTF-16 Surrogate Pair via \\u: \uD83D\uDC7E"); // U+1F47E
print("");

// \u{} throws an error when using JavaScript (SMonkey 24.2.0).
print("\\u{} is noted as being \"experimental, should not be used in production code\":");
print("Code Point / UTF-32 via \\u{}: \u{0F02}");  // NO EFFECT (YET!!!)
print("Code Point / UTF-32 via \\u{}: \u{1F47E}"); // NO EFFECT (YET!!!)
print("-------------------------------");

print("UTF-16 via String.fromCharCode(decimal): " + String.fromCharCode(3842));
print("UTF-16 via String.fromCharCode(hex): " + String.fromCharCode(0x0F02));
print("");

print("UTF-16 Surrogate Pair via String.fromCharCode(decimal): " + String.fromCharCode(55357, 56446));
print("UTF-16 Surrogate Pair via String.fromCharCode(hex): " + String.fromCharCode(0xD83D, 0xDC7E));
print("");

print("Multiple UTF-16 via String.fromCharCode(decimal): " + String.fromCharCode(3842, 32, 55357, 56446));
print("Multiple UTF-16 via String.fromCharCode(hex): " + String.fromCharCode(0x0F02, 0x20, 0xD83D, 0xDC7E));
print("-------------------------------");

// Like \x, the octal escape sequence uses the ISO-8859-1 character set
print("Octal notation is \\888 where '888' = 1 - 3 octal digits (values 0 - 7; range 0 - 377):");
print("\\11 and \\011: tab\11tabby\011tab");
print("\\7, \\07, and \\007: bell\7bell\07bell\007bell");
print("\\176 = \176 ; \\177 = \177 ; \\200 = \200 ; \\237 = \237");
print("\\242 = \242 ; \\377 = \377 ; \\504 = \504");
print("-------------------------------");

// String.fromCodePoint() raises an error in both (rhino 1.7.7) and (SMonkey 24.2.0).
//print("Code Point / UTF-32 via String.fromCodePoint(decimal): " + String.fromCodePoint(3842));
//print("Code Point / UTF-32 via String.fromCodePoint(hex): " + String.fromCodePoint(0x0F02));

See JavaScript demo on “IDE One”

Documentation Improvements and/or Corrections:




 

Julia

The “Characters” and “Byte Array Literals” sections of the main “Strings” documentation states that you can use the following sequences in both double quoted strings and single quoted character literals:

  • WARNING: All escape sequences in Julia are variable length. Be careful when specifying less than the maximum number of digits. If the characters that immediately follow the escape sequence are valid hex or octal digits (depending on the type of escape sequence being used), they will be interpreted as being part of the escape sequence. Meaning, \uA1 produces “¡”, but if the next character is “A” (or “a”), then it will instead be interpreted as being \uA1a and produce “ਚ”, which is Code Point U+0A1A. In such cases, specifying the maximum number of digits for that type of escape sequence (e.g. \u00A1 ) would solve the problem. See “Warning” example block below.
  • 1 to 4 \888 for single byte / UTF-8 code unit (“888” is 1 – 3 octal digits [0 – 7]; value between 0 and 377 )
    • Technically, values 400 – 777 are accepted, but they merely equate to that value minus 400 (e.g. 400 == 0, 567 == 167, and 777 == 377)
    • Values 0 – 177 (and 400 – 577) map directly to standard ASCII characters of those same byte values.
    • Values 200 – 377 (and 600 – 777) are only valid for constructing proper UTF-8 encodings of characters.
  • 1 to 4 \xHH for single byte / UTF-8 code unit (“HH” is 1 – 2 hex digits; value between 0 and FF )
    • Values 0 – 7F map directly to standard ASCII characters of those same byte values.
    • Values 80 – FF are only valid for constructing proper UTF-8 encodings of characters.
  • \uHHHH for BMP Code Point (“HHHH” is 1 – 4 hex digits; value between 0 and FFFF )
    • \u cannot be used to specify pairs of Surrogate Code Points (i.e. Surrogate Pairs) to create Supplementary Characters.
  • \UHHHHHHHH for any Code Point / UTF-32 code unit (“HHHHHHHH” is 1 – 8 hex digits; value between 0 and 0010FFFF )
  • Also see:

Testing done with command-line julia.exe Version 1.2.0 (2019-08-20).

julia> # \x works with one or two hex digits:
julia> print("TAB\x9TAB\x09TAB")
TAB     TAB     TAB


julia> # \x is directly encoding UTF-8; it is not ISO-8859-1:
julia> print("\\xC1 should be Á, but here it's: \xC1")
\xC1 should be Á, but here it's: �


julia> # UTF-8 bytes for U+0F02:
julia> codepoint('\xE0\xBC\x82')
0x00000f02


julia> # UTF-8 bytes for U+1F47E:
julia> codepoint('\xF0\x9F\x91\xBE')
0x0001f47e

julia> ##################################################

julia> # Like \x, the octal escape sequence injects single bytes into a UTF-8 encoding.
julia> # Octal notation is \\888 where '888' = 1 - 3 octal digits (values 0 - 7; range 0 - 377):

julia> print("\\11 and \\011: tab\11tabby\011tab")
\11 and \011: tab       tabby   tab

julia> print("\\7, \\07, and \\007: Bell\7Bell\07Bell\007Bell")
\7, \07, and \007: BellBellBellBell

julia> # UTF-8 bytes for U+0F02:
julia> codepoint('\340\274\202')
0x00000f02

julia> # UTF-8 bytes for U+1F47E:
julia> codepoint('\360\237\221\276')
0x0001f47e

julia> ##################################################

julia> # BMP Code Point (U+0000 - U+FFFF) via \u:
julia> codepoint('\uF02')
0x00000f02

julia> codepoint('\u0F02')
0x00000f02

julia> # \u produces code points, not bytes:
julia> print("\xE0\xBC\x82  as opposed to: \uE0\uBC\u82")
?  as opposed to: �

julia> ##################################################

julia> # BMP and Supplementary Character Code Points U+0000 - U+10FFFF) via \U:
julia> codepoint('\UF02')
0x00000f02

julia> codepoint('\U0F02')
0x00000f02

julia> codepoint('\U000F02')
0x00000f02

julia> codepoint('\U00000F02')
0x00000f02


julia> codepoint('\U1F47E')
0x0001f47e

julia> codepoint('\U1F47E')
0x0001f47e

julia> codepoint('\U01F47E')
0x0001f47e

julia> codepoint('\U0001F47E')
0x0001f47e


julia> # \U produces code points, not bytes:
julia> print("\xE0\xBC\x82  as opposed to: \UE0\UBC\U82")
?  as opposed to: �

julia> print("\xF0\x9F\x91\xBE  as opposed to: \UF0\U9F\U91\UBE")
�  as opposed to: ð??¾

Documentation Improvements and/or Corrections:




 

Java

The “3.3. Unicode Escapes” section of the “Chapter 3. Lexical Structure” documentation, as well as the “3.10.6. Escape Sequences for Character and String Literals” section, state that you can use the following escape sequences in strings:

  • \888 for ISO-8859-1 character (“888” is 1 – 3 octal digits [0 – 7]; value between 0 and 377 )
  • \uHHHH for BMP character (“HHHH” is always 4 hex digits; value between 0001 and FFFF )
  • \uHHHH\uHHHH for a Surrogate Pair / Two UTF-16 Code Units to create Supplementary Character
  • String(int[] codePoints, int offset, int count) String constructor. Specify pairs of int code units (i.e. Surrogate Pairs) to create Supplementary Characters. See documentation
  • Character.toChars(int codePoint) static method. This will return a char[] containing one element if codePoint represents a BMP character, else two elements (the Surrogate Pair) if it represents a Supplementary Character. See documentation
  • Character.toChars(int codePoint, char[] dst, int dstIndex) static method. This will replace one element of the char[] if codePoint represents a BMP character, else two elements (the Surrogate Pair) if it represents a Supplementary Character. See documentation and example below.
import java.util.*;
import java.lang.*;
import java.io.*;

class SqlQuantumLeap
{
    public static void main (String[] args) throws java.lang.Exception
    {
        // The octal escape sequence uses the ISO-8859-1 character set
        System.out.println("Octal notation is \\888 where '888' = 1 - 3 octal digits (values 0 - 7; range 0 - 377):");
        System.out.println("\\11 and \\011: tab\11tabby\011tab");
        System.out.println("\\7, \\07, and \\007: bell\7bell\07bell\007bell");
        System.out.println("\\176 = \176 ; \\177 = \177 ; \\200 = \200 ; \\237 = \237");
        System.out.println("\\242 = \242 ; \\377 = \377 ; \\504 = \504");
        System.out.println("-------------------------------");

        System.out.println("BMP Code Point / UTF-16 via \\u: \u0F02");
        System.out.println("UTF-16 Surrogate Pair via \\u\\u: \uD83D\uDC7E"); // U+1F47E
        System.out.println("-------------------------------");

        //  ---------------------------------------------------------------

        System.out.println("String constructor: " + new String(
            new int[]{ 0x0F02, 32, 65, 32, 0xD83D, 0xDC7E }, 0, 6 )); // U+1F47E

        char[] tc1 = Character.toChars(0x0F02);
        System.out.println("Character.toChars(int) static method (codePoint = U+0F02):");
        System.out.println("   Size of array returned for BMP Character: " + tc1.length);
        System.out.println("   String created from char[]: " + new String(tc1));

        char[] tc2 = Character.toChars(0x1F47E);
        System.out.println("Character.toChars(int) static method (codePoint = U+1F47E):");
        System.out.println("   Size of array returned for Supplementary Character: " + tc2.length);
        System.out.println("   String created from char[]: " + new String(tc2));

        char[] tc3 = new char[] { 65, 66, 67, 68, 69, 70 };
        System.out.println("Character.toChars(int, char[], int) static method (codePoint = U+1F47E):");
        System.out.println("   Initial String created from char[]: " + new String(tc3));
        Character.toChars(0x1F47E, tc3, 2); // insert into middle, between spaces
        System.out.println("   String created from char[] after Character.toChars(): " + new String(tc3));
    }
}

See Java demo on “IDE One”




 

Excel / VBA

  • Pre-Excel 2013
    • Create the following VBA function. You might need to “Show Developer tab in the Ribbon”, and the steps to do that differ between versions of Excel:
      1. Office button
      2. “Excel Options” button
      3. Go to tab:
        • Older Excel: “Customize” tab
        • Newer Excel: “Quick Access Toolbar” tab
      4. Select “Visual Basic” from list of commands on left side. If this command is not in the list, you might need to select “All Commands” or “Developer Tab” from the drop-down above the list of commands.
      5. “Add >>” button
      6. “OK” button
    • Adapted from @stema’s answer on SuperUser.StackExchange
    • Click the “Visual Basic” button
    • Insert a new Module with the following contents:
      Function UnicodeFromInt(val As Long)
          If val < 0 Or val > 1114111 Then
              UnicodeFromInt = "ERROR: value must be between 0 and 1114111!!"
              GoTo GetOut
          End If
      
          If val >= 55296 And val <= 57343 Then
              UnicodeFromInt = "ERROR: surrogate code points are not displayable!!"
              GoTo GetOut
          End If
      
      
          If val < 65536 Then
              UnicodeFromInt = ChrW(val)
          Else
              UnicodeFromInt = ChrW(55232 + Int(val / 1024)) & ChrW(56320 + Int(val Mod 1024))
          End If
      
      GetOut:
      End Function
      
      Function UnicodeFromHex(val As String)
          UnicodeFromHex = UnicodeFromInt("&H" & val)
      End Function
      
  • Starting in Excel 2013
    • UNICHAR(DDDD) function (“DDDD” is an integer value from 1 to 1114111, representing a Unicode Code Point). See documentation

Pre-Office 2013:

=UnicodeFromInt(3842)

=UnicodeFromHex("F02")
=UnicodeFromHex("0F02")



=UnicodeFromInt(128126)

=UnicodeFromHex("1F47E")
=UnicodeFromHex("01F47E")

Starting in Office 2013:

=UNICHAR(3842)

=UNICHAR(128126)

2 thoughts on “Unicode Escape Sequences Across Various Languages and Platforms (including Supplementary Characters)”

Leave a Reply to SSMS Tip #3: Easily Access/Research ALL Unicode Characters (Yes, Including Emojis 😸) – Sql Quantum LeapCancel reply