The Uni-Code: The Search for the True List of Valid Characters for T-SQL Identifiers, Part 3 of 2 (Delimited Identifiers)

DaVinciCode-PuzzleBox

(last updated: 2018-04-26)

This is Part 3 of 2. Yes, you read that correctly. You see, way back at the very beginning (i.e. in Part 1), I mentioned:

…based on my experiences, it seems that more often than not, doing an exhaustive test results in a slightly different answer that invalidates the previous conclusion (which was based on the limited test). So, while it does take more time to do more extensive testing, it seems like we have little choice if we truly want to know how these things actually work.

As much as I appreciate irony, I was about to move on, having done minimal testing on delimited identifiers and assuming that there wasn’t much else of interest there. After all, the original question which sent me looking into identifiers was specific to regular (i.e. non-delimited) identifiers, and with delimited identifiers, everything just works, right? Right??

Well…

More Testing

Same as before: the only way to truly know is to check everything.

Variables

Now let’s try to declare a variable with a delimited name. We can first try to use delimiters with a name that would work even without them, just to make it easy: if this doesn’t work, then it can only be due to the delimiters and not the other characters being used.

DECLARE [@x] INT;
/*
Msg 155, Level 15, State 2, Line XXXXX
'INT' is not a recognized CURSOR option.
*/

Nope, that did not work. And, the @ is being ignored as we can see here (since we get the exact same error, even without the @):

DECLARE [x] INT;
/*
Msg 155, Level 15, State 2, Line XXXXX
'INT' is not a recognized CURSOR option.
*/

Valid Characters

Testing for valid / invalid characters will be slightly tricky for a couple of reasons:

  1. We need to make sure that we execute within a Database that has a default Collation that is Supplementary Character-aware. This did not matter when testing for valid characters for regular identifiers since those cannot use Supplementary Characters. But now that we will be testing all Unicode characters, we need to make sure that the NCHAR() function can generate Supplementary Characters, and that requires being in a Database where the default Collation is Supplementary Character-aware (meaning: either name ends with _SC, or had _140_ in the name).
    Technically speaking, we don’t need to use the NCHAR() function, but it just makes things so much easier.
  2. When testing regular identifiers we had 65,536 characters to test, and there was definitely a noticable performance difference between testing “first” characters (which required creating an object) and testing “subsequent” characters (which could be done by declaring a variable). Declaring the variable was much faster, which makes sense as there is very little overhead: it does not take and release any locks, etc. But now we have 230,000 Code Points to test 1, which is almost 4 times as many Code Points as before, so testing by creating an object will take that much longer.
    There aren’t many options for identifiers that are not objects. Testing by declaring a variable is prefered, but as we just saw in the previous test, variables cannot be delimited. A table variable would allow for specifying a column name as a delimited identifier, but it still requires allocation / deallocation in tempdb. I did try it and it was taking a while to complete. There is a column alias in a SELECT statement, but we absolutely do not want 230,000 result sets returned to SSMS!! However, we can use a table alias. That requires a query, so we just need a query that won’t return a result set. And for that we can simply set a variable.

SETUP

CREATE DATABASE [DelimitedIdentifiers] COLLATE Latin1_General_100_CI_AS_SC;
GO
USE [DelimitedIdentifiers];

CREATE TABLE dbo.InvalidCharacter
(
  [CodePoint]    BINARY(2) NOT NULL
                   CONSTRAINT [PK_InvalidCharacter] PRIMARY KEY CLUSTERED
                   WITH (FILLFACTOR = 100)
);

TEST

SET NOCOUNT ON;
DECLARE @Index     INT = 0,
        @SQL       NVARCHAR(MAX),
        @Counter   INT = 0,
        @TheChar   NVARCHAR(2),
        @CodePoint NVARCHAR(10);

WHILE (@Index < 230000)
BEGIN
  SET @TheChar = NCHAR(@Index);
  SET @SQL = N'DECLARE @Dummy INT; SELECT @Dummy = [' + @TheChar +
               N'].col FROM (VALUES (2)) [' + @TheChar + N'](col);';
  --SET @SQL = N'DECLARE @Dummy TABLE ([' + @TheChar + N'] INT);'; --SLOWER
  SET @CodePoint = CONVERT(NVARCHAR(10), CONVERT(BINARY(4), @Index), 1);

  BEGIN TRY

    RAISERROR(N'%6d: %s', 10, 1, @Index, @SQL) WITH NOWAIT;
    EXEC (@SQL);

  END TRY
  BEGIN CATCH
    SET @Counter += 1;

    INSERT INTO dbo.InvalidCharacter ([CodePoint])
      VALUES (CONVERT(VARBINARY(2), @Index));

    RAISERROR(N'** %6d: Code Point %s ( %s )', 10, 1,
              @Counter, @CodePoint, @TheChar) WITH NOWAIT;
  END CATCH;

  SET @Index += 1;
END;

When that completes, you can check the “Messages” tab to see everything that was attempted.

REVIEW RESULTS

But even after that output goes away upon executing the next query, we can check the invalid characters that we saved to the table:

SELECT [CodePoint],
       CONVERT(INT, [CodePoint]) AS [CodePointInt],
       NCHAR([CodePoint]) AS [TheChar]
FROM   dbo.InvalidCharacter;

/*
CodePoint    CodePointInt   TheChar
0x0000                  0
0x005D                 93     ]
0xFFFF              65535      
*/

As you can see from the results (at the bottom), there are only 3 characters that are invalid. And looking more closely, one of them — ]is valid, it just needs to be escaped. So truly there are only two characters that are invalid.

Escaping

The QUOTENAME function is quite handy as it not only applies the delimiters to both sides of the input value, but it also escapes embedded delimiters:

SELECT QUOTENAME(N'a[b]c''d');
-- [a[b]]c'd]

As you can see from the result, it only escaped the right square bracket ( ] ), and did so by duplicating it. Copy and paste that result from the grid into the following statement and execute it:

SELECT 1 AS [a[b]]c'd];

And it works, showing ( a[b]c'd ) as the column name. Next let’s try double-quotes. This time, we need to pass in the second parameter to QUOTENAME since we are not using the default character this time:

SELECT QUOTENAME(N'a"b"c''d', N'"');
-- "a""b""c'd"

As you can see from the result, it escaped the double-quote ( " ) by duplicating it. Copy and paste that result from the grid into the following statement and execute it:

SELECT 2 AS "a""b""c'd";

And it works, showing ( a"b"c'd ) as the column name. Next let’s try the two invalid characters:

SELECT QUOTENAME(N'a' + NCHAR(0x0000) + N'd');
-- NULL

SELECT QUOTENAME(N'a' + NCHAR(0xFFFF) + N'd');
-- [a d]
SELECT 3 AS [a d];
/*
Msg 1055, Level 16, State 1, Line XXXXX
'..' is an invalid name because it contains a NULL character or an
    invalid unicode character.
*/

So, the first character, U+0000, NULLs out the string. Interesting. The second character, U+FFFF, causes an error. Good to know.

Supplementary Characters

In the documentation for “Collation and Unicode Support: Supplementary Characters” it states:

If you use supplementary characters:

  • Supplementary characters can be used in ordering and comparison operations in collation versions 90 or greater.
  • Supplementary characters are not supported for use in metadata, such as in names of database objects.2

We can validate both of those statements.

First we change the Database’s default Collation to one that does not assign any weights to Supplementary Characters:

ALTER DATABASE [DelimitedIdentifiers] COLLATE Latin1_General_CI_AS;

Next we create a table with a Supplementary Character and it works. But, attempting to add a second one fails due to the comparison reporting that they are the same (since both appear as an empty string in terms of sort weights):

CREATE TABLE dbo.[🀀] ([col1] INT); -- Mahjong Tile East Wind ( U+1F000 )
-- Success

CREATE TABLE dbo.[🀗] ([col1] INT); -- Mahjong: Eight of Bamboos ( U+1F017 )
/*
Msg 2714, Level 16, State 6, Line XXXXX
There is already an object named '🀗' in the database.
*/

Now we change the Database’s default Collation to a version 90 Collation (i.e. one that was introduced in SQL Server 2005, which is version 90 of SQL Server, which is really “9.0” without the decimal). Once the Database is operating with the new Collation we can create the Table that failed in the previous test since version 90 Collations can see Supplementary Characters as distinct. However, once we have added that new Table, we can no longer change the Database’s Collation to one that didn’t apply any sort weights to Supplementary Characters since they would again compare as being the same thing and object names must be unique (within a Schema, of course).

ALTER DATABASE [DelimitedIdentifiers] COLLATE Macedonian_FYROM_90_CI_AS;

CREATE TABLE dbo.[🀗] ([col1] INT); -- Mahjong: Eight of Bamboos (U+1F017)
-- Success!


ALTER DATABASE [DelimitedIdentifiers] COLLATE SQL_Latin1_General_CP1_CI_AS;
/*
Msg 1505, Level 16, State 1, Line XXXXX
The CREATE UNIQUE INDEX statement terminated because a duplicate key was
    found for the object name 'dbo.sysschobjs' and the index name 'nc1'.
    The duplicate key value is (0, 1, 🀗).
Msg 5072, Level 16, State 1, Line XXXXX
ALTER DATABASE failed. The default collation of database
    'DelimitedIdentifiers' cannot be set to SQL_Latin1_General_CP1_CI_AS.
*/

Binary Collations, even pre-version 90 Collations, were not mentioned in the documentation as ones that allow for ordering and comparison operations with Supplementary Characters, but it would make sense for them to work since they don’t use sort weights, they just use the byte / Code Point values. As you can see below, a binary Collation (not to mention a SQL Server Collation) works just fine:

ALTER DATABASE [DelimitedIdentifiers] COLLATE SQL_Latin1_General_CP850_BIN;
-- Success!

Next we drop one of the two tables, and we add another table having a two-character name. If Supplementary Characters have a sort weight but it just happens to be the same for all of them, then two tables, each with names containing a different number of Supplementary Characters will work. But, if there is no sort weight added to any of them, then this will not work in a Database using a Collation that is neither binary nor version 90 or newer.

-- Remove one of the two single-character tables:
DROP TABLE dbo.[🀗];

CREATE TABLE dbo.[🀗🀗] ([col1] INT); -- Eight of Bamboos ( U+1F017 )

ALTER DATABASE [DelimitedIdentifiers] COLLATE SQL_Latin1_General_CP1_CI_AS;
/*
Msg 1505, Level 16, State 1, Line XXXXX
The CREATE UNIQUE INDEX statement terminated because a duplicate key
   was found for the object name 'dbo.sysschobjs' and the index name
   'nc1'. The duplicate key value is (0, 1, 🀗🀗).
*/


ALTER DATABASE [DelimitedIdentifiers] COLLATE Latin1_General_CI_AS;
-- same error

That did not work, so we know that Supplementary Characters have no sort weights at all (i.e. they are all effectively “0” / zero). Next we drop the Table with the two-character name and create a new Table with two regular US English letter “o”s. The regular letters make the name unique so that we can finally change the Collation to one that does not specify any sort weights for Supplementary Characters:

DROP TABLE dbo.[🀗🀗];

CREATE TABLE dbo.[🀗o🀗o🀗] ([Col_1234] INT); -- Eight of Bamboos (U+1F017)

ALTER DATABASE [DelimitedIdentifiers] COLLATE Latin1_General_CI_AS;
-- Success!

Next we verify that the number of Supplementary Characters makes no difference when using a non-binary, pre-version 90 Collation:

CREATE TABLE dbo.[🀟🀟] ([col1] INT); -- Mahjong Tile Seven of Circles
-- error 2714
CREATE TABLE dbo.[🀟🀟🀟🀟🀟🀟] ([col1] INT); -- Seven of Circles (U+1F01F)
-- error 2714


CREATE TABLE dbo.[🀟🀟o🀟🀟o🀟🀟] ([col1] INT); -- Tile: Seven of Circles
-- error 2714

CREATE TABLE dbo.[🀟🀟🀟o🀟🀟🀟🀟🀟🀟o🀟🀟🀟🀟] ([col1] INT); -- (U+1F01F)
/*
Msg 2714, Level 16, State 6, Line XXXXX
There is already an object named '🀟🀟🀟o🀟🀟🀟🀟🀟🀟o🀟🀟🀟🀟'
    in the database.
*/

Finally we see that the problem that Supplementary Characters are having is not a problem unique to Supplementary Characters. It seems that many characters (almost a full third, actually) have no sort weight assigned to them. So it doesn’t matter what combination they’re used in, nor how many of them:

CREATE TABLE dbo.[🀟ѐ🀟o🀟ѐ🀟🀟ѐ🀟o🀟🀟ѐ🀟] ([col1] INT); -- ( U+1F01F )
/*
Msg 2714, Level 16, State 6, Line XXXXX
There is already an object named '🀟ѐ🀟o🀟ѐ🀟🀟ѐ🀟o🀟🀟ѐ🀟'
    in the database.
*/


-- All 3 of the following statements return the same row:
SELECT [name], [object_id] FROM sys.tables WHERE [name] = N'oo';
SELECT [name], [object_id] FROM sys.tables WHERE [name] = N'🀀o🀗o🀟';
SELECT [name], [object_id] FROM sys.tables WHERE [name] = N'ǶǹȡoɄʩoϿӜ';


-- All 3 of the following statements SELECT from the same Table:
SELECT * FROM dbo.[oo];
SELECT * FROM dbo.[🀀o🀗o🀟];
SELECT * FROM dbo.[ǶǹȡoɄʩoϿӜ];

To be clearer about what is going on with the Collations, the following chart shows how many BMP characters match each of the two characters shown across the top when using the Collations listed on the left:

Collation Name Characters Matching N’ⴥ’
Georgian Small Letter Hoe
( U+2D25 )
Characters Matching N’🀗’
Mahjong Tile Eight of Bamboos
( U+1F017 )
SQL_Latin1_General_CP1_CI_AS
and Latin1_General_CI_AS
21,230 21,230
Macedonian_FYROM_90_CI_AS 13,415 0
Latin1_General_100_CI_AS 2 0

As you can see, sort weights for supplementary characters were added in the version 90 Collations that came with SQL Server 2005.

Temporary Objects

For this test I create a local temporary table using the “Playing Cards 1F0A0–1F0FF” characters (which are Supplementary Characters and hence need to be delimited). Please note that, due to how the Supplementary Characters are created, the following test does not require being executed in a Database using a Supplementary Character-aware Collation.

DECLARE @RoyalFlushSpades NVARCHAR(50);
SET @RoyalFlushSpades = N'#' +
  NCHAR(0xD83C) + NCHAR(0xDCA1) +
  NCHAR(0xD83C) + NCHAR(0xDCAE) +
  NCHAR(0xD83C) + NCHAR(0xDCAD) +
  NCHAR(0xD83C) + NCHAR(0xDCAB) +
  NCHAR(0xD83C) + NCHAR(0xDCAA)

SELECT @RoyalFlushSpades;
-- #🂡🂮🂭🂫🂪

SELECT [name], [object_id]
FROM   tempdb.sys.tables
WHERE  [name] LIKE @RoyalFlushSpades + N'%';

DECLARE @SQL NVARCHAR(MAX);
SET @SQL = N'
  CREATE TABLE [' + @RoyalFlushSpades + N'] (Col1 INT);

  SELECT [name], [object_id]
  FROM   tempdb.sys.tables
  WHERE  [name] LIKE N''' + @RoyalFlushSpades + N'%'';
';

PRINT @SQL;

EXEC (@SQL);

/*
name                         object_id
#🂡🂮🂭🂫🂪___...__000000000005  -1546592471
*/

Transactions

The following works (as long as the character — Editorial Coronis U+2E0E — is delimited):

BEGIN TRAN [⸎];

COMMIT TRAN [⸎];

Conclusion

F-I-N-A-L-L-Y, we have tested all that there is to test. Hopefully. If you know of something that has been overlooked, please leave a comment and I will look into it.

As for the information discovered via the tests show above, that had already been incorporated into the main post, “What’s in a Name?: Inside the Wacky World of T-SQL Identifiers“, as of 2018-04-11. It just took me several days to write it up as a post.


  1. Not all Code Points are mapped to characters (yet). The total possible number of Code Points that can ever be mapped is just over 1.1 million. But so far, the highest Code Point I have found that is mapped is CJK COMPATIBILITY IDEOGRAPH-2F9F4 (Decimal / INT value = 195060). Either way, we will still test that full 230k just to be safe. 
  2. “Not supported” does not necessarily mean “does not work”. Undocumented functions and stored procedures work, but Microsoft does not support their usage and will not help you with them if they do not work as expected. It could also be that this usage did not work when the documentation was first written (for SQL Server 2005) but does work in newer versions, making this statement obsolete and needing to be removed (or at least clarified).
     
    (2018-04-25) Since the initial writing of this post, I have been able to get supplementary characters working in SQL Server 2005 running on Windows XP. The main problem was that initially they did not display as anything but two squares. This was not an issue with SSMS for SQL Server 2005, but with Windows XP itself, and the fonts that come with Windows XP. I will add my notes to the “Supplementary Characters” section above in the next few days. 

1 thought on “The Uni-Code: The Search for the True List of Valid Characters for T-SQL Identifiers, Part 3 of 2 (Delimited Identifiers)”

Thoughts? Questions? Comments? Suggestions? Words of praise?

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s