(last updated: 2020-01-10 @ 14:15 EST / 2020-01-10 @ 19:15 UTC )
CONTENTS
I often need to include Unicode-only characters in my scripts, posts, etc., and have found that including such characters directly can sometimes lead to problems when there are encoding “issues”. So, as much as possible I try to escape all Code Points above U+007F (value 127 in decimal), leaving me with a highly transportable / mostly risk-free document. But, this means that I need to know how to escape Unicode characters in various languages. After looking through the documentation for a number of languages and platforms, I have noticed that the descriptions can sometimes be misleading or at least unclear, and the examples, if any are provided, are nearly always showing standard ASCII characters such as an uppercase US English “A”. Very few show Unicode-only BMP Code Points, and even fewer show how to escape Supplementary Characters. Not showing examples of escaping Supplementary Characters is a problem because they can be trickier to escape, especially if the documentation is incomplete or misleading.
The purpose of this post is to correct the overall lack of examples. Everything shown below are actual working examples of creating both a Unicode-only BMP character (meaning a non-Supplementary Character that would require Unicode) and a Supplementary Character. Most examples include a link to an online demo, either on db<>fiddle (for database demos) or IDE One (for non-database demos), both very cool and handy sites.
I use the same two characters across all examples to hopefully make them all easier to understand. Those two characters are:
Unicode-only BMP Character
Tibetan Mark Gter Yig Mgo -Um Rnam Bcad Ma ( U+0F02 ) | ༂ | |||||||||||||||
|
Supplementary Character
Alien Monster ( U+1F47E ) | 👾 | |||||||||||||||
|
This post will be updated in the near future to include additional platforms and languages, such as: Oracle, DB2, R, Python, and VB.NET.
HTML, XHTML, and XML
&#DD;
for Code Points in decimal notation (“DD” is a decimal value between 1 and 1114111 )&#xHHHHHH;
for Code Points in hex notation (“HH” = a hex value between 1 and 10FFFF )- In XML, the “x” is required to be lower-case (e.g.
ģ
is invalid in XML, but valid in HTML).
- In XML, the “x” is required to be lower-case (e.g.
Decimal: ༂ Hex: ༂ Decimal: 👾 Hex: 👾
Microsoft SQL Server (T-SQL)
SQL Server technically does not have character escape sequences, but you can still create characters using either byte sequences or Code Points using the CHAR()
and NCHAR()
functions. We are only concerned with Unicode here, so we will only be using NCHAR()
.
- All versions:
NCHAR(0 - 65535)
for BMP Code Points (using an int/decimal value)NCHAR(0x0 - 0xFFFF)
for BMP Code Points (using a binary/hex value)NCHAR(0 - 65535) + NCHAR(0 - 65535)
for a Surrogate Pair / Two UTF-16 Code UnitsNCHAR(0x0 - 0xFFFF) + NCHAR(0x0 - 0xFFFF)
for a Surrogate Pair / Two UTF-16 Code UnitsCONVERT(NVARCHAR(size), 0xHHHH)
for one or more characters in UTF-16 Little Endian (“HHHH” is 1 or more sets of 4 hex digits)
- Starting in SQL Server 2012:
- If database’s default collation supports Supplementary Characters (collation name ends in
_SC
, or starting in SQL Server 2017 name contains_140_
but does not end in_BIN*
, or starting in SQL Server 2019 name ends in_UTF8
but does not contain_BIN2
), thenNCHAR()
can be given Supplementary Character Code Points:- decimal value can go up to 1114111
- hex value can go up to 0x10FFFF
- If database’s default collation supports Supplementary Characters (collation name ends in
- Starting in SQL Server 2019:
- “
_UTF8
” collations enableCHAR
andVARCHAR
data to use the UTF-8 encoding:CONVERT(VARCHAR(size), 0xHH)
for one or more characters in UTF-8 (“HH” is 1 or more sets of 2 hex digits)- NOTE: The
CHAR()
function does not work for this purpose. It can only produce a single byte, and UTF-8 is only a single byte for values 0 – 127 / 0x00 – 0x7F.
- “
All versions of SQL Server (at least since 2005, if not earlier):
SELECT N'T' + NCHAR(9) + N'A' + NCHAR(0x9) + N'B' AS [Single Decimal or Hex Digit], NCHAR(0xF02) AS [Code Point (from hex)], NCHAR(3842) AS [Code Point (from decimal)], -- We are passing in "values", _not_ "escape sequences" NCHAR(0x0000000000000000000000F02) AS [BINARY / hex "value"], NCHAR(0003842.999999999) AS [INT / decimal "value"]; -- The following syntaxes work regardless of the database's collation: SELECT NCHAR(0xD83D) + NCHAR(0xDC7E) AS [UTF-16 Surrogate Pair (BINARY/hex)], NCHAR(55357) + NCHAR(56446) AS [UTF-16 Surrogate Pair (INT/decimal)], CONVERT(NVARCHAR(10), 0x3DD87EDC) AS [UTF-16LE bytes];
Starting with SQL Server 2012:
-- The following syntax only works if the database's default collation -- supports Supplementary Characters (starting in SQL 2012), else the -- NCHAR() function returns NULL: SELECT NCHAR(0x1F47E) AS [UTF-32 (BINARY / hex)], NCHAR(128126) AS [UTF-32 (INT / decimal)];
Starting with SQL Server 2019:
-- Works if current database has a "_UTF8" default collation: SELECT CONVERT(VARCHAR(10), 0xF09F91BE); -- UTF-8 bytes -- Works regardless of database's default collation: DECLARE @Temp TABLE ( [TheValue] VARCHAR(10) COLLATE Latin1_General_100_CI_AS_SC_UTF8 NOT NULL ); INSERT INTO @Temp ([TheValue]) VALUES (0xF09F91BE); -- UTF-8 bytes SELECT * FROM @Temp;
See SQL Server 2017 demo on db<>fiddle
See SQL Server 2019 / UTF-8 demo on db<>fiddle
Also see:
- Please vote for my suggestion to improve the
NCHAR()
function so that it always supports Supplementary Characters, regardless of the database’s default collation:
NCHAR() function should always return Supplementary Character for values 0x1000 – 0x10FFFF regardless of active database’s default collation - How Many Bytes Per Character in SQL Server: a Completely Complete Guide
MySQL
There is no Unicode character escape according to the “Special Character Escape Sequences” section of the String Literals documentation. And I did try the usual ones: \x
, \X
, \u
, \U
, and \U{}
.
However, you could just use a hex literal. The Hexadecimal Literals documentation states:
- Values written using
X'val'
notation must contain an even number of digits or a syntax error occurs. To correct the problem, pad the value with a leading zero - Values written using
0xval
notation that contain an odd number of digits are treated as having an extra leading 0. For example,0xaaa
is interpreted as0x0aaa
.
The other option is the CHAR() function which has an optional using
clause for specifying the encoding.
_utf8mb4 0xHH
for UTF-8 bytes (“HH” is 1 or more hex digits)_utf8mb4 X'HH'
(“HH” is an even number of hex digits)_utf32 0xHH
for Code Point / UTF-32 (“HH” is 1 or more hex digits)_utf16 0xHH
for UTF-16 (implied Big Endian ; “HH” is 1 or more hex digits)_utf16le 0xHH
for UTF-16 Little Endian (“HH” is 1 or more hex digits)CHAR(0xHH USING encoding)
(encoding name is not prefixed with an underscore “_” here!)- The “utf8” encoding can only handle BMP characters (i.e. 1 – 3 bytes per character)
- The “utf8mb4” encoding can handle all Unicode character, BMP and Supplementary Characters (i.e. 1 – 4 bytes per character)
- The 0xHH notation seems more convenient since it assumes leading zeros, so you can specify
0x1F47E
instead of0x01F47E
, and it’s more consistent with most other languages / platforms. - The options shown here are not true escape sequences. They are series of bytes, allowing you to specify multiple characters in a single sequence. For example, the following all produce two characters, “AB”:
_utf8 0x4142
_utf16 0x00410042
CHAR(0x4142 USING utf8)
CHAR(0x00410042 USING utf16)
Two different HEX notations:
SELECT _utf8mb4 0xF09F91BE AS "UTF-8 bytes in 0x notation", _utf8mb4 X'F09F91BE' AS "UTF-8 bytes in X'' notation", _utf32 0x1F47E AS "Code Point in 0x notation", _utf32 X'01F47E' AS "Code Point in X'' notation";
Introducers:
# BMP Character ( U+0F02 ): SELECT _utf8 0xE0BC82, # 3-byte (BMP-only) UTF-8 _utf8mb4 0xE0BC82, # Full UTF-8 _utf16 0xF02, # UTF-16 (implied Big Endian) _utf16le 0x020F, # UTF-16 Little Endian _utf32 0xF02; # Code Point / UTF-32 # Supplementary Character ( U+1F47E ): SELECT _utf16 0xD83DDC7E, # UTF-16 (implied Big Endian) Surrogate Pair _utf16le 0x3DD87EDC, # UTF-16 Little Endian Surrogate Pair _utf32 0x1F47E; # Code Point / UTF-32
CHAR() function:
# CHAR(0xHEX USING encoding) function: SELECT CHAR(0xF09F91BE USING utf8mb4), # UTF-8 bytes CHAR(0xD83DDC7E USING utf16), # UTF-16 (Big Endian) Surrogate Pair CHAR(0x3DD87EDC USING utf16le), # UTF-16 Little Endian Surrogate Pair CHAR(0x0001F47E USING utf32), # Code Point / UTF-32 CHAR(0x1F47E USING utf32); # Code Point (implied leading zeros)
See MySQL 8.0 demo on db<>fiddle
See request to add capability of using U&''
escape syntax (same as what PostgreSQL uses): WL#3529: Unicode Escape Sequences (original request linked at the bottom of the “High Level Architecture” tab, BUG 10199)
PostgreSQL
\xHH
(“HH” is 1 – 2 hex digits: \xH, \xHH; value between 1 and FF )\uHHHH
for a BMP Code Point (“HHHH” is always 4 hex digits; value between 0001 and FFFF )\uHHHH\uHHHH
for a Surrogate Pair / Two UTF-16 Code Units\U00HHHHHH
for the Code Point / UTF-32 bytes (“HHHHHH” is always 6 hex digits; value between 000001 and 10FFFF )U&'\HHHH'
for a BMP Code Point (“HHHH” is always 4 hex digits; value between 0001 and FFFF )U&'\+HHHHHH'
for any Code Point / UTF-32 bytes (“HHHHHH” is always 6 hex digits; value between 000001 and 10FFFF )
Also, the “String Constants With C-Style Escapes” and “String Constants With Unicode Escapes” sections of Lexical Structure documentation states:
- The Unicode escape syntax works fully only when the server encoding is
UTF8
. - When surrogate pairs are used when the server encoding is UTF8, they are first combined into a single code point that is then encoded in UTF-8.
- Also, the
U&'\xxxx'
Unicode escape syntax for string constants only works when the configuration parameter standard_conforming_strings is turned on… If the parameter is set to off, this syntax will be rejected with an error message.
SELECT E'TAB\x9TAB' AS "Single Byte", E'\xF0\x9F\x91\xBE' AS "UTF-8 bytes"; SELECT E'\u0F02' AS "Code Point", E'\uD83D\uDC7E' AS "UTF-16 Surrogate Pair", E'\U0000D83D\U0000DC7E' AS "UTF-16 Surrogate Pair via UTF-32", E'\U0001F47E' AS "UTF-32"; SELECT E'\U0010FFFF' AS "Highest UTF-32 Code Point"; SELECT U&'\0F02' AS "Code Point", U&'\D83D\DC7E' AS "UTF-16 Surrogate Pair", U&'\+00D83D\+00DC7E' AS "UTF-16 Surrogate Pair via UTF-32", U&'\+01F47E' AS "UTF-32";
See PostgreSQL 11 demo on db<>fiddle
C#
C# is a Microsoft .NET language.
The “String Escape Sequences” section of the Strings (C# Programming Guide) documentation states:
\xHHHH
(“HHHH” is 1 – 4 hex digits: \xH, \xHH, \xHHH, or \xHHHH; value between 1 and FFFF )- WARNING: be careful when specifying less than 4 hex digits. If the characters that immediately follow the escape sequence are valid hex digits, they will be interpreted as being part of the escape sequence. Meaning,
\xA1
produces “¡”, but if the next character is “A” (or “a”), then it will instead be interpreted as being\xA1a
and produce “ਚ”, which is Code Point U+0A1A. In such cases, specifying all 4 hex digits (e.g.\x00A1
) would solve the problem. See “Warning” example block below.
- WARNING: be careful when specifying less than 4 hex digits. If the characters that immediately follow the escape sequence are valid hex digits, they will be interpreted as being part of the escape sequence. Meaning,
\uHHHH
for BMP character (“HHHH” is always 4 hex digits; value between 0001 and FFFF )\uHHHH\uHHHH
or\xHHHH\xHHHH
or\uHHHH\xHHHH
or\xHHHH\uHHHH
for a Surrogate Pair / Two UTF-16 Code Units to create a\ Supplementary Character\U00HHHHHH
for Code Point / UTF-32 (“HHHHHH” is always 6 hex digits; value between 000001 and 10FFFF )The documentation incorrectly states that this syntax is for Surrogate Pairs. I will submit a correction for that.(please see “Documentation Improvements and/or Corrections” just below example code)- In creating a test case to prove that the
\U
escape does not handle surrogate pairs, I found a bug in Mono: if the first 4 hex digits are in the range of 0x8000 – 0xFFFF then they are completely ignored and the last 4 hex digits are processed as if the first four digits were specified as being 0x0000 (i.e. as a regular UTF-16 code unit). I submitted an issue for this: “\U” Unicode escape sequence for strings accepts invalid value instead of raising error #15456.
Console.WriteLine( "One to Four hex digits via \\x: W\x9W, X\x09X, Y\x009Y, Z\x0009Z"); Console.WriteLine(""); Console.WriteLine("Always four hex digits via \\u: TAB\u0009TAB"); Console.WriteLine(""); Console.WriteLine("Unicode-only BMP character: (\\x) \x0F02 (\\u) \u0F02"); Console.WriteLine(""); Console.WriteLine( "Two UTF-16 Code Units (i.e. Surrogate Pair) via \\x: \xD83D\xDC7E"); Console.WriteLine( "Two UTF-16 Code Units (i.e. Surrogate Pair) via \\u: \uD83D\uDC7E"); Console.WriteLine(""); Console.WriteLine("Code Point / UTF-32 via \\U: \U00000F02"); Console.WriteLine("Code Point / UTF-32 via \\U: \U0001F47E"); Console.WriteLine(""); Console.WriteLine("Highest Code Point / UTF-32 via \\U: \U0010FFFF");
WARNING: be care when using \x with less than 4 hex digits:
Console.WriteLine("-------------------"); Console.WriteLine("\\xA1 followed by a ..."); Console.WriteLine("..non-alphanumeric character ([space]): \xA1 A"); Console.WriteLine("..non-hex digit (Z): \xA1Z"); Console.WriteLine( "..hex digit, but intended to be used as itself (A): \xA1Ay, caramba!"); // \xA1Ay returns "ਚy" instead of "¡Ay" because \xA1A produces U+0A1A Console.WriteLine( "\\x00A1 followed by a hex digit (A): \x00A1Aye aye, Captain!");
Documentation Improvements and/or Corrections:
- Fix and improve Unicode escape sequence info (C#) #13162 submitted on 2019-06-28, merged on 2019-07-01.
- Finish improvements to “String Escape Sequences” section of “Strings (C# Programming Guide)” page submitted on 2019-07-09, merged on 2019-07-10.
- “Correctify Unicode-related errors and omissions (in the C# specification): escape sequences”: Issue #2672, Pull Request #2675 submitted on 2019-07-21
F#
F# is a Microsoft .NET language.
See the “Remarks” section of the Strings documentation.
\DDD
for decimal byte notation (“DDD” is always 3 decimal digits; value between 000 and 255 )- This escape is effectively ISO-8859-1 (first 256 characters are the same as Unicode)
- Technically, value can go up to 999, but resulting character is determined by
DDD % 256
(where%
is modulus operator)
\xHH
for hex byte notation (“HH” is always 2 hex digits; value between 01 and FF )- NOTE:
this escape is not documented. Not sure if that is oversight or intentional.(please see “Documentation Improvements and/or Corrections” just below example code) - This escape is effectively ISO-8859-1 (first 256 characters are the same as Unicode)
- Output is still UTF-16 (leading “00” is implied:
\x41
is really\u0041
)
- NOTE:
\uHHHH
for BMP character (“HHHH” is always 4 hex digits; value between 0001 and FFFF )\uHHHH\uHHHH
for a Surrogate Pair / Two UTF-16 Code Units to create a Supplementary Character\U00HHHHHH
for the Code Point / UTF-32 bytes (“HHHHHH” is always 6 hex digits; value between 000001 and 10FFFF )The documentation (for “Literals”) incorrectly states that this syntax is for Surrogate Pairs. I will submit a correction for that.(please see “Documentation Improvements and/or Corrections” just below example code)
printfn "UNDOCUMENTED Decimal (NOT Octal) \\DDD requires 3 digits: TAB\9TAB\09TAB\009TAB"; printfn "\\DDD notation is ISO-8859-1 (U+0000 - U+00FF): {\128-\129-\144-\152-\160-\161}"; printfn "CHAR for \\DDD = (DDD %% 256); Max = \\999 (U+00E7): {\365-\621-\6210-\176-\100-\999-\1000}"; printfn "---------------------"; printfn "UNDOCUMENTED \\x only works with two hex digits: TAB\x9TAB\x090TAB"; printfn "\\x is ISO-8859-1: 0x80 = \x80, 0x81 = \x81, 0x90 = \x90, 0x9A = \x9A, 0x9F = \x9F"; printfn "\\x is _not_ creating UTF-8: \xE0\xBC\x82"; // UTF-8 bytes for U+0F02 printfn "---------------------"; printfn "UTF-16 via \\u: \u0F02"; // ? printfn "UTF-16 Surrogate Pair via \\u: \uD83D\uDC7E"; // U+1F47E printfn "---------------------"; printfn "Code Point / UTF-32 via \\U: \U00000F02"; // ? printfn "Code Point / UTF-32 via \\U: \U0001F47E";
Documentation Improvements and/or Corrections:
- Fix and improve Unicode escape sequence info (F#) #13168 submitted on 2019-06-28, merged on 2019-07-01.
- More String escape sequence improvements (F#) submitted on 2019-07-05, merged on 2019-07-08.
- Source code comments: Fix Supplementary Character / Surrogate Pair info (no code changes) submitted on 2019-07-12, merged on 2019-07-13.
Microsoft Visual C++ / C-Style
The “Escape Sequences” and “Universal character names” sections of the String and Character Literals (C++) documentation states:
\888
for an encoding-dependent character (“888” is 1 – 3 octal digits [0 – 7]; value between 0 and 777 )\xHHHH
for an encoding-dependent character (“HHHH” is 1 – 4 hex digits: \xH, \xHH, \xHHH, or \xHHHH; value between 0 and FFFF )- WARNING: be careful when specifying less than 4 hex digits. If the characters that immediately follow the escape sequence are valid hex digits, they will be interpreted as being part of the escape sequence. Meaning,
\xA1
produces “¡”, but if the next character is “A” (or “a”), then it will instead be interpreted as being\xA1a
and produce “ਚ”, which is Code Point U+0A1A. In such cases, specifying all 4 hex digits (e.g.\x00A1
) would solve the problem. See “Warning” example block below.
- WARNING: be careful when specifying less than 4 hex digits. If the characters that immediately follow the escape sequence are valid hex digits, they will be interpreted as being part of the escape sequence. Meaning,
\uHHHH
(“HHHH” is always 4 hex digits; value between 0000 and FFFF )\U00HHHHHH
for Code Point / UTF-32 (“HHHHHH” is always 6 hex digits; value between 000001 and 10FFFF )- NOTE: Neither
\xHHHH\xHHHH
nor\uHHHH\uHHHH
can be used to represent a Surrogate Pair (i.e. two UTF-16 code units)
#include "stdafx.h" #include <iostream> int main() { // In Command Prompt, run the following first to get this console app to return values: // CHCP 65001 std::wcout << u8"\\11 and \\011: tab\11tabby\011tab" << u8"\n"; std::wcout << u8"\\7, \\07, and \\007: bell\7bell\07bell\007bell" << u8"\n"; std::wcout << u8"\\176 = \176 ; \\177 = \177 ; \\200 = \200 ; \\237 = \237" << u8"\n"; std::wcout << u8"\\242 = \242 ; \\377 = \377 ; \\777 = \777" << u8"\n"; // \777 == \u01FF std::wcout << u8"-------------------------------" << u8"\n"; std::wcout << u8"\\x works with 1 or 2 hex digits: TAB\x9TAB\x09TAB" << u8"\n"; std::wcout << u8"\\x works with 3 or 4 hex digits: Yadda\xA1Yadda\xA1AYadda\xA1AAYadda" << u8"\n"; std::wcout << u8"\\x is ISO-8859-1: 0x80 = \x80, 0x81 = \x81, 0x90 = \x90, 0x9A = \x9A, 0x9F = \x9F" << u8"\n"; std::wcout << u8"\\x is _not_ creating UTF-8: \xE0\xBC\x82" << u8"\n"; // UTF-8 bytes for U+0F02 std::wcout << u8"-------------------------------" << u8"\n"; std::wcout << u8"BMP Code Point / UTF-16 via \\u: \u0F02" << u8"\n"; //std::wcout << L"UTF-16 Surrogate Pair via \\u: \uD83D\uDC7E" << L"\n"; // U+1F47E // compile error //std::wcout << u"UTF-16 Surrogate Pair via \\u: \uD83D\uDC7E" << u"\n"; // U+1F47E // compile error //std::wcout << u8"UTF-16 Surrogate Pair via \\u: \uD83D\uDC7E" << u8"\n"; // U+1F47E // compile error std::wcout << u8"-------------------------------" << u8"\n"; std::wcout << u8"Code Point / UTF-32 via \\U: \U00000F02" << u8"\n"; std::wcout << u8"Code Point / UTF-32 via \\U: \U0001F47E" << u8"\n"; std::wcout << u8"Code Point / UTF-32 via \\U: \U0010FFFF" << u8"\n"; //std::wcout << u8"Code Point / UTF-32 via \\U: \U00110000" << u8"\n"; // compile error std::wcout << u8"-------------------------------" << u8"\n"; return 0; }
I could not get the example code shown above to run on “IDE One”, but it did work as expected when compiled in Visual Studio, as a console app, and run from a Command Prompt.
NOTE: Be sure to run the following in a Command Prompt first if you are going to run the example shown above (it sets the code page to UTF-8):
C:\>CHCP 65001
C
- 1 to 4
\xHH
for UTF-8 bytes (or whatever encoding the system is using; “HH” is 1 – 2 hex digits: \xH, \xHH; value between 1 and FF ) \U00HHHHHH
for the Code Point / UTF-32 bytes (“HHHHHH” is always 6 hex digits; value between 000001 and 10FFFF )
printf("\\x can escape a single hex digit: TAB\x9TAB"); printf("\n\n"); printf("Three UTF-8 bytes via \\x: \xE0\xBC\x82"); // U+0F02 printf("\n"); printf("Four UTF-8 bytes via \\x: \xF0\x9F\x91\xBE"); // U+1F47E printf("\n\n"); printf("The \\U syntax requires 8 hex digits (first two are always 0):\n"); printf("Code Point / UTF-32 via \\U: \U00000F02"); printf("\n"); printf("Code Point / UTF-32 via \\U: \U0001F47E");
PHP
The “Double quoted” section of the String documentation states that you can use the following sequences in double quoted, not single quoted, strings:
- All versions of PHP
- 1 to 4
\888
for single byte / UTF-8 code unit (“888” is 1 – 3 octal digits [0 – 7]; value between 0 and 377 )- Values 0 – 177 (and 400 – 577) map directly to standard ASCII characters of those same byte values.
- Values 200 – 377 (and 600 – 777) are only valid for constructing proper UTF-8 encodings of characters.
- Technically, values 400 – 777 are accepted, but they merely equate to that value minus 400 (e.g. 400 == 0, 567 == 167, and 777 == 377). Using these values might result in a warning being thrown (e.g. “PHP Warning: Octal escape sequence overflow \476 is greater than \377”).
- 1 to 4
\xHH
for single byte / UTF-8 code unit (“HH” is 1 – 2 hex digits; value between 0 and FF )- Values 0 – 7F map directly to standard ASCII characters of those same byte values.
- Values 80 – FF are only valid for constructing proper UTF-8 encodings of characters.
- 1 to 4
- Starting in PHP 7.0.0
\u{HHHHHH}
for the Code Point / UTF-32 bytes (“HHHHHH” is 1 – 6 hex digits; value between 0 and 10FFFF )
All versions of PHP:
echo "PHP version: ".phpversion()."\n\n"; echo "The following should work in all PHP versions:\n"; echo "\\x can escape a single hex digit: TAB\x9TAB"; echo "\n\n"; echo "Three UTF-8 bytes via \\x: \xE0\xBC\x82"; # U+0F02 echo "\n"; echo "Four UTF-8 bytes via \\x: \xF0\x9F\x91\xBE"; # U+1F47E echo "\n\n"; echo "Octal notation is \\888 where '888' = 1 - 3 octal digits (values 0 - 7; range 0 - 377):"; echo "\n"; echo "\\11 and \\011: tab\11tabby\011tab"; echo "\n"; echo "\\7, \\07, and \\007: bell\7bell\07bell\007bell"; echo "\n"; echo "\\076 = \076 ; \\176 = \176 ; \\476 = \476 ; \\576 = \576"; echo "\n"; echo "UTF-8 bytes for U+0F02: \\340\\274\\202: \340\274\202"; echo "\n"; echo "UTF-8 bytes for U+1F47E: \\360\\237\\221\\276: \360\237\221\276"; echo "\n"; echo "UTF-8 bytes for U+1F47E: \\760\\637\\621\\676: \760\637\621\676"; echo "\n\n";
Starting in PHP 7.0.0:
echo "The following should work starting in PHP version 7.0.0:\n"; echo "Code Point / UTF-32 via \\u{}: \u{0F02}"; echo "\n"; echo "Code Point / UTF-32 via \\u{}: \u{1F47E}";
More info on the “\u{}
” syntax
JavaScript
The “Escape notation” section of the String global object documentation states that you can use the following sequences in both double quoted and single quoted strings:
- All versions of JavaScript
\888
for ISO-8859-1 character (“888” is 1 – 3 octal digits [0 – 7]; value between 0 and 377 )\xHH
for ISO-8859-1 character (“HH” is always 2 hex digits; value between 00 and FF )\uHHHH
(“HHHH” is always 4 hex digits; value between 0000 and FFFF )fromCharCode(NNN [, NNN, ...])
function for UTF-16 (“NNN” is 1 or more integer and/or hex UTF-16 values; value between 0 and 65535 / 0xFFFF ). Specify Surrogate Pairs to create Supplementary Characters. See documentation.The documentation incorrectly states that this function cannot create Supplementary Characters. I will submit a correction for that.(please see “Documentation Improvements and/or Corrections” just below example code)
- Newer versions?
fromCodePoint(NNN [, NNN, ...])
function for Code Point / UTF-32 (“NNN” is 1 or more integer and/or hex Code Points / UTF-32 values; value between 0 and 1114111 / 0x10FFFF ). See documentation- While updating the documentation for this function, I discovered a bug in the documentation editor in that it crashes with a “500 Internal Server Error” when saving if there are any supplementary characters. Oops. (please see “Documentation Improvements and/or Corrections” just below example code)
- I cannot get this function to work on either of the JavaScript versions on IDEOne.com, but…
- The following does work in my browser (Chrome):
alert("String.fromCodePoint(0x1F47E) = " + String.fromCodePoint(0x1F47E));
- Coming Soon
\u{HHHHHH}
for the Code Point / UTF-32 code unit (“HHHHHH” is 1 – 6 hex digits; value between 0 and 10FFFF )- Documentation states: “This is an experimental API that should not be used in production code”
- I cannot get this syntax to work on either of the JavaScript versions on IDEOne.com, but…
- The following does work in my browser (Chrome):
alert("\\u{1F47E} = \u{1F47E}");
// \x9 throws an error when using JavaScript (SMonkey 24.2.0). print("\\x only works with two hex digits: TAB\x9TAB\x090TAB"); print("\\x is ISO-8859-1: 0x80 = \x80, 0x81 = \x81, 0x90 = \x90, 0x9A = \x9A, 0x9F = \x9F"); print("\\x is _not_ creating UTF-8: \xE0\xBC\x82"); // UTF-8 bytes for U+0F02 print(""); print("BMP Code Point / UTF-16 via \\u: \u0F02"); print("UTF-16 Surrogate Pair via \\u: \uD83D\uDC7E"); // U+1F47E print(""); // \u{} throws an error when using JavaScript (SMonkey 24.2.0). print("\\u{} is noted as being \"experimental, should not be used in production code\":"); print("Code Point / UTF-32 via \\u{}: \u{0F02}"); // NO EFFECT (YET!!!) print("Code Point / UTF-32 via \\u{}: \u{1F47E}"); // NO EFFECT (YET!!!) print("-------------------------------"); print("UTF-16 via String.fromCharCode(decimal): " + String.fromCharCode(3842)); print("UTF-16 via String.fromCharCode(hex): " + String.fromCharCode(0x0F02)); print(""); print("UTF-16 Surrogate Pair via String.fromCharCode(decimal): " + String.fromCharCode(55357, 56446)); print("UTF-16 Surrogate Pair via String.fromCharCode(hex): " + String.fromCharCode(0xD83D, 0xDC7E)); print(""); print("Multiple UTF-16 via String.fromCharCode(decimal): " + String.fromCharCode(3842, 32, 55357, 56446)); print("Multiple UTF-16 via String.fromCharCode(hex): " + String.fromCharCode(0x0F02, 0x20, 0xD83D, 0xDC7E)); print("-------------------------------"); // Like \x, the octal escape sequence uses the ISO-8859-1 character set print("Octal notation is \\888 where '888' = 1 - 3 octal digits (values 0 - 7; range 0 - 377):"); print("\\11 and \\011: tab\11tabby\011tab"); print("\\7, \\07, and \\007: bell\7bell\07bell\007bell"); print("\\176 = \176 ; \\177 = \177 ; \\200 = \200 ; \\237 = \237"); print("\\242 = \242 ; \\377 = \377 ; \\504 = \504"); print("-------------------------------"); // String.fromCodePoint() raises an error in both (rhino 1.7.7) and (SMonkey 24.2.0). //print("Code Point / UTF-32 via String.fromCodePoint(decimal): " + String.fromCodePoint(3842)); //print("Code Point / UTF-32 via String.fromCodePoint(hex): " + String.fromCodePoint(0x0F02));
See JavaScript demo on “IDE One”
Documentation Improvements and/or Corrections:
- fromCodePoint() – July 3rd and 4th, 2019
- fromCharCode() – July 4th, 2019
- “Escape notation” section of the String global object – July 4th, 2019
- Bug found: Documentation editor does not allow saving Unicode Supplementary Characters – reported July 3rd, 2019
Julia
The “Characters” and “Byte Array Literals” sections of the main “Strings” documentation states that you can use the following sequences in both double quoted strings and single quoted character literals:
- WARNING: All escape sequences in Julia are variable length. Be careful when specifying less than the maximum number of digits. If the characters that immediately follow the escape sequence are valid hex or octal digits (depending on the type of escape sequence being used), they will be interpreted as being part of the escape sequence. Meaning,
\uA1
produces “¡”, but if the next character is “A” (or “a”), then it will instead be interpreted as being\uA1a
and produce “ਚ”, which is Code Point U+0A1A. In such cases, specifying the maximum number of digits for that type of escape sequence (e.g.\u00A1
) would solve the problem. See “Warning” example block below. - 1 to 4
\888
for single byte / UTF-8 code unit (“888” is 1 – 3 octal digits [0 – 7]; value between 0 and 377 )- Technically, values 400 – 777 are accepted, but they merely equate to that value minus 400 (e.g. 400 == 0, 567 == 167, and 777 == 377)
- Values 0 – 177 (and 400 – 577) map directly to standard ASCII characters of those same byte values.
- Values 200 – 377 (and 600 – 777) are only valid for constructing proper UTF-8 encodings of characters.
- 1 to 4
\xHH
for single byte / UTF-8 code unit (“HH” is 1 – 2 hex digits; value between 0 and FF )- Values 0 – 7F map directly to standard ASCII characters of those same byte values.
- Values 80 – FF are only valid for constructing proper UTF-8 encodings of characters.
\uHHHH
for BMP Code Point (“HHHH” is 1 – 4 hex digits; value between 0 and FFFF )\u
cannot be used to specify pairs of Surrogate Code Points (i.e. Surrogate Pairs) to create Supplementary Characters.
\UHHHHHHHH
for any Code Point / UTF-32 code unit (“HHHHHHHH” is 1 – 8 hex digits; value between 0 and 0010FFFF )- Also see:
Testing done with command-line julia.exe Version 1.2.0 (2019-08-20).
julia> # \x works with one or two hex digits: julia> print("TAB\x9TAB\x09TAB") TAB TAB TAB julia> # \x is directly encoding UTF-8; it is not ISO-8859-1: julia> print("\\xC1 should be Á, but here it's: \xC1") \xC1 should be Á, but here it's: � julia> # UTF-8 bytes for U+0F02: julia> codepoint('\xE0\xBC\x82') 0x00000f02 julia> # UTF-8 bytes for U+1F47E: julia> codepoint('\xF0\x9F\x91\xBE') 0x0001f47e julia> ################################################## julia> # Like \x, the octal escape sequence injects single bytes into a UTF-8 encoding. julia> # Octal notation is \\888 where '888' = 1 - 3 octal digits (values 0 - 7; range 0 - 377): julia> print("\\11 and \\011: tab\11tabby\011tab") \11 and \011: tab tabby tab julia> print("\\7, \\07, and \\007: Bell\7Bell\07Bell\007Bell") \7, \07, and \007: BellBellBellBell julia> # UTF-8 bytes for U+0F02: julia> codepoint('\340\274\202') 0x00000f02 julia> # UTF-8 bytes for U+1F47E: julia> codepoint('\360\237\221\276') 0x0001f47e julia> ################################################## julia> # BMP Code Point (U+0000 - U+FFFF) via \u: julia> codepoint('\uF02') 0x00000f02 julia> codepoint('\u0F02') 0x00000f02 julia> # \u produces code points, not bytes: julia> print("\xE0\xBC\x82 as opposed to: \uE0\uBC\u82") ? as opposed to: à¼? julia> ################################################## julia> # BMP and Supplementary Character Code Points U+0000 - U+10FFFF) via \U: julia> codepoint('\UF02') 0x00000f02 julia> codepoint('\U0F02') 0x00000f02 julia> codepoint('\U000F02') 0x00000f02 julia> codepoint('\U00000F02') 0x00000f02 julia> codepoint('\U1F47E') 0x0001f47e julia> codepoint('\U1F47E') 0x0001f47e julia> codepoint('\U01F47E') 0x0001f47e julia> codepoint('\U0001F47E') 0x0001f47e julia> # \U produces code points, not bytes: julia> print("\xE0\xBC\x82 as opposed to: \UE0\UBC\U82") ? as opposed to: à¼? julia> print("\xF0\x9F\x91\xBE as opposed to: \UF0\U9F\U91\UBE") � as opposed to: ð??¾
Documentation Improvements and/or Corrections:
- Correct doc for \U escape sequence under Base/Strings/unescape_string #33285 – September 16th, 2019
- Correct code point format in Base/Char/show function #33291 – September 16th, 2019
Java
The “3.3. Unicode Escapes” section of the “Chapter 3. Lexical Structure” documentation, as well as the “3.10.6. Escape Sequences for Character and String Literals” section, state that you can use the following escape sequences in strings:
\888
for ISO-8859-1 character (“888” is 1 – 3 octal digits [0 – 7]; value between 0 and 377 )\uHHHH
for BMP character (“HHHH” is always 4 hex digits; value between 0001 and FFFF )\uHHHH\uHHHH
for a Surrogate Pair / Two UTF-16 Code Units to create Supplementary CharacterString(int[] codePoints, int offset, int count)
String constructor. Specify pairs ofint
code units (i.e. Surrogate Pairs) to create Supplementary Characters. See documentationCharacter.toChars(int codePoint)
static method. This will return achar[]
containing one element ifcodePoint
represents a BMP character, else two elements (the Surrogate Pair) if it represents a Supplementary Character. See documentationCharacter.toChars(int codePoint, char[] dst, int dstIndex)
static method. This will replace one element of thechar[]
ifcodePoint
represents a BMP character, else two elements (the Surrogate Pair) if it represents a Supplementary Character. See documentation and example below.
import java.util.*; import java.lang.*; import java.io.*; class SqlQuantumLeap { public static void main (String[] args) throws java.lang.Exception { // The octal escape sequence uses the ISO-8859-1 character set System.out.println("Octal notation is \\888 where '888' = 1 - 3 octal digits (values 0 - 7; range 0 - 377):"); System.out.println("\\11 and \\011: tab\11tabby\011tab"); System.out.println("\\7, \\07, and \\007: bell\7bell\07bell\007bell"); System.out.println("\\176 = \176 ; \\177 = \177 ; \\200 = \200 ; \\237 = \237"); System.out.println("\\242 = \242 ; \\377 = \377 ; \\504 = \504"); System.out.println("-------------------------------"); System.out.println("BMP Code Point / UTF-16 via \\u: \u0F02"); System.out.println("UTF-16 Surrogate Pair via \\u\\u: \uD83D\uDC7E"); // U+1F47E System.out.println("-------------------------------"); // --------------------------------------------------------------- System.out.println("String constructor: " + new String( new int[]{ 0x0F02, 32, 65, 32, 0xD83D, 0xDC7E }, 0, 6 )); // U+1F47E char[] tc1 = Character.toChars(0x0F02); System.out.println("Character.toChars(int) static method (codePoint = U+0F02):"); System.out.println(" Size of array returned for BMP Character: " + tc1.length); System.out.println(" String created from char[]: " + new String(tc1)); char[] tc2 = Character.toChars(0x1F47E); System.out.println("Character.toChars(int) static method (codePoint = U+1F47E):"); System.out.println(" Size of array returned for Supplementary Character: " + tc2.length); System.out.println(" String created from char[]: " + new String(tc2)); char[] tc3 = new char[] { 65, 66, 67, 68, 69, 70 }; System.out.println("Character.toChars(int, char[], int) static method (codePoint = U+1F47E):"); System.out.println(" Initial String created from char[]: " + new String(tc3)); Character.toChars(0x1F47E, tc3, 2); // insert into middle, between spaces System.out.println(" String created from char[] after Character.toChars(): " + new String(tc3)); } }
Excel / VBA
- Pre-Excel 2013
- Create the following VBA function. You might need to “Show Developer tab in the Ribbon”, and the steps to do that differ between versions of Excel:
- Office button
- “Excel Options” button
- Go to tab:
- Older Excel: “Customize” tab
- Newer Excel: “Quick Access Toolbar” tab
- Select “Visual Basic” from list of commands on left side. If this command is not in the list, you might need to select “All Commands” or “Developer Tab” from the drop-down above the list of commands.
- “Add >>” button
- “OK” button
- Adapted from @stema’s answer on SuperUser.StackExchange
- Click the “Visual Basic” button
- Insert a new Module with the following contents:
Function UnicodeFromInt(val As Long) If val < 0 Or val > 1114111 Then UnicodeFromInt = "ERROR: value must be between 0 and 1114111!!" GoTo GetOut End If If val >= 55296 And val <= 57343 Then UnicodeFromInt = "ERROR: surrogate code points are not displayable!!" GoTo GetOut End If If val < 65536 Then UnicodeFromInt = ChrW(val) Else UnicodeFromInt = ChrW(55232 + Int(val / 1024)) & ChrW(56320 + Int(val Mod 1024)) End If GetOut: End Function Function UnicodeFromHex(val As String) UnicodeFromHex = UnicodeFromInt("&H" & val) End Function
- Create the following VBA function. You might need to “Show Developer tab in the Ribbon”, and the steps to do that differ between versions of Excel:
- Starting in Excel 2013
UNICHAR(DDDD)
function (“DDDD” is an integer value from 1 to 1114111, representing a Unicode Code Point). See documentation
Pre-Office 2013:
=UnicodeFromInt(3842) =UnicodeFromHex("F02") =UnicodeFromHex("0F02") =UnicodeFromInt(128126) =UnicodeFromHex("1F47E") =UnicodeFromHex("01F47E")
Starting in Office 2013:
=UNICHAR(3842) =UNICHAR(128126)
[…] Solomon Rutzky shows how to perform Unicode character escaping in a dozen places: […]
[…] P.S. For more info (including additional languages) regarding Unicode escape sequences, please see: Unicode Escape Sequences Across Various Languages and Platforms (including Supplementary Characters) […]