The new UNISTR function is well intentioned, but more problematic than beneficial

(last updated: 2024-11-01 @ 01:15 ET / 2024-11-01 @ 05:15 UTC )

On June 4th, 2024, Abhiman Tiwari published a blog post titled, “Announcing UNISTR and || operator in Azure SQL Database – preview“. The first half of that post describes a new function, UNISTR (currently only available in Azure SQL Database; haven’t tried Azure SQL Managed Instance). The purpose of this function is to make it easier to inject any Unicode character. This can be done either by specifying the character’s Unicode code point, or it’s UTF-16 surrogate pair. Unfortunately, the syntax for specifying code points is obscure when it could (and should) be more standard.

Since the new UNISTR function doesn’t provide new functionality, only convenience (“syntactic sugar” as some would say; see comment below), I would argue that it should not only use a more standard syntax, but also not waste the opportunity and provide more substantive convenience by handling several commonly used escape sequences. I suspect that the number of times people would use “\n” is several orders of magnitude more than the number of times people would inject emojis or other non-keyboard characters. Even better would be to incorporate common escape sequences into standard string parsing.

I submitted the following comment on July 30, 2024. As of October 31, 2024 it is still “awaiting moderation” and new comments have been disabled.

I appreciate the effort the team has put into making it easier to work with Unicode, but please, please, please, please, please, please, do not release the UNISTR() built-in function. I understand that time has already been spent in planning, developing, and testing, but the end result is not only not in the best interest of SQL Server or the end users, it’s actually counter-productive.

The first problem is: the function itself provides almost no benefit to users. The ability to inject any Unicode character has been available since SQL Server 7.0 (1998; approx. 26 years) via NCHAR(0 / 0x0000 – 65535 / 0xFFFF) for BMP characters and NCHAR(High Surrogate) + NCHAR(Low Surrogate) for Supplementary characters. The ability to inject Supplementary Characters directly from their code point (rather than their surrogate pair) has been available since SQL Server 2012 (12 years) via NCHAR(>= 65536 / 0x010000) in databases with an “_SC” or “_140_” default collation. The only real benefit of UNISTR() is for creating Supplementary Characters using their Unicode Code point (i.e. >= 0x10000) when the current DB is not using either an “_SC” or “_140_” default collation. A minor benefit of UNISTR() is slightly less concatenation when injecting multiple Unicode characters.

I have confirmed this in both SQL Server 7.0 and 2012:

-- I may have needed to add Unicode fonts and even updated the registry to
-- support supplementary characters in order for them to display correctly in
-- Windows XP / Query Analyzer
----------------------
GO
USE [master];
SELECT SUBSTRING(@@VERSION, 1, 25) AS [SQL Server Version],
     '--' AS [--],
     NCHAR(0xD83D) + NCHAR(0xDE00) AS [UTF-16]; -- same as: UNISTR(N'\D83D\DE00')

----------------------
/*
Microsoft SQL Server  7.0   --  😃
*/

SQL Server 2012 and newer:

GO
USE [master];
SELECT SUBSTRING(@@VERSION, 1, 25) AS [SQL Server Version],
     '--' AS [--],
     DB_NAME() AS [DB Name],
     '--' AS [--],
     DATABASEPROPERTYEX(DB_NAME(), 'Collation') AS [DB Default Collation],
     '--' AS [--],
     NCHAR(0xD83D) + NCHAR(0xDE00) AS [UTF-16], -- same as: UNISTR(N'\D83D\DE00')
     '--' AS [--],
     NCHAR(0x1F603) AS [CodePoint]; -- same as: UNISTR(N'\+01F603')

CREATE DATABASE [_UnicodeTest] COLLATE Latin1_General_100_CI_AS_SC;
GO
USE [_UnicodeTest];
GO
SELECT SUBSTRING(@@VERSION, 1, 25) AS [SQL Server Version],
     '--' AS [--],
     DB_NAME() AS [DB Name],
     '--' AS [--],
     DATABASEPROPERTYEX(DB_NAME(), 'Collation') AS [DB Default Collation],
     '--' AS [--],
     NCHAR(0xD83D) + NCHAR(0xDE00) AS [UTF-16], -- same as: UNISTR(N'\D83D\DE00')
     '--' AS [--],
     NCHAR(0x1F603) AS [CodePoint]; -- same as: UNISTR(N'\+01F603')

USE [master];
GO
DROP DATABASE [_UnicodeTest];
GO
----------------------
/*
SQL Server Version        --  DB Name     --  DB Default Collation        --  UTF-16 --  CodePoint
Microsoft SQL Server 2012 -- master       -- SQL_Latin1_General_CP1_CI_AS -- 😃     --  NULL
Microsoft SQL Server 2012 -- _UnicodeTest -- Latin1_General_100_CI_AS_SC  -- 😃     --  😃
*/

---------------------------------------------------------------------------------------------------

The second problem is: the syntax is obscure and even potentially problematic, and thus will be harder to use and maintain as compared to a more commonly used escape sequence. The \ and \+ syntax seems to be PostgreSQL-only. CSS uses similar \{CodePoint}, ranging from \0 to \10FFFF. Is there is a specific business-related desire to have parity with what is essentially a proprietary syntax that is rarely used?

In addition to the obscurity issue, the proposed syntax can be problematic as it is often used as the escape sequence for octal notation (e.g. \010 = CHAR(9) in C / C++, Java, JavaScript, Julia, PHP, and others). F# uses \ for creating ISO-8859-1 characters.

Overall, the UNISTR function is side-stepping the real issue and is likely to divert attention and resources from providing a real, substantive solution that will provide real benefit to nearly all customers. The real issue is that standard escape sequences should be natively supported as part of normal string literal parsing. I have created a feedback suggestion for this here:

Support .NET string escape sequences (e.g. unescape ‘\n’ into CHAR(10), etc)

I added the following note to that suggestion:

Since it’s possible (even if unlikely) that current code has such sequences and does not expect any translations, a database-level, instance-level, or both (preferred, and similar to Cross-Database Ownership Chaining) config option should control this. This approach would also allow for current app code to take advantage of this new ability without need for any (or much) modification (similar to how adding UTF-8 collations in SQL Server 2019 allowed existing app code to handle Unicode without needing to modify the code to prefix strings with “N”).

If there’s some reason that it’s technically problematic to handle escape sequences in string literals such that handling them requires a function, then at the very least please use the more standard syntax (that mostly mirrors what .NET uses) and not the PostgreSQL-only \ and \+ syntax.

 

I really hope someone over at Microsoft gives some strong consideration to the points I’ve made above. It’s clear that PostgreSQL is getting more popular given how it’s mentioned more and more in SQL Server Central articles and in SQL PASS Summit sessions. However, I see no benefit in copying a mistake that PostgreSQL (and to a degree, CSS) made years ago by adopting an obscure syntax that nobody else, or to be fair, virtually nobody else (since I haven’t checked all other languages and environments) uses (and is even used in some languages as the octal escape sequence). I fear that by moving forward with this obscure syntax, Microsoft will be painting themselves into a corner with a function that works, but doesn’t work in the way that is most natural and beneficial for users, and will just become technical debt once a function that does work more ideally is provided. So, even if it’s out of scope to implement the standard (and widely used) “\n” and “\t” escape sequences, and/or incorporate escape sequences into standard string parsing, at least go with the more widely used “\u” and “\U” syntax, which is also used in .NET. Thanks 😺.


Post Update History
  • 2024-10-31 @ 17:50 ET / 2024-10-31 @ 21:50 UTC — Initial posting
  • 2024-11-01 @ 01:15 ET / 2024-11-01 @ 05:15 UTC — Minor edits

1 thought on “The new UNISTR function is well intentioned, but more problematic than beneficial”

Leave a Reply