The Uni-Code: The Search for the True List of Valid Characters for T-SQL Regular Identifiers, Part 1

DaVinciCode-PuzzleBox

(last updated: 2018-04-09)

In answering a recent question on DBA.StackExchange related to why some characters work for parameter names and others do not ( How to create Unicode stored procedure parameter names ), I pointed out that the documentation for Database Identifiers states (slightly edited for readability):

Rules for Regular Identifiers

  1. The first character must be one of the following:

    • A letter as defined by the Unicode Standard 3.2. The Unicode definition of letters includes Latin characters from a through z, from A through Z, and also letter characters from other languages.
    • The underscore (_), at sign (@), or number sign (#).

      Certain symbols at the beginning of an identifier have special meaning in SQL Server:

      • A regular identifier that starts with the at sign (@) always denotes a local variable or parameter and cannot be used as the name of any other type of object.
      • An identifier that starts with a number sign (#) denotes a temporary table or procedure.
      • An identifier that starts with double number signs (##) denotes a global temporary object.

      Although the number sign or double number sign characters can be used to begin the names of other types of objects, we do not recommend this practice.

  2. Subsequent characters can include the following:

    • Letters as defined in the Unicode Standard 3.2.
    • Decimal numbers from either Basic Latin or other national scripts.
    • The at sign (@), dollar sign ($), number sign (#), or underscore (_).

  3. Supplementary characters are not allowed.

Ok. So what are the valid “letters” and “decimal numbers” from other national scripts? There are 65,536 Unicode code points (often refered to as “characters”) that are not supplementary characters. While clearly characters in blocks such as Arrows, Mathematical Operators, Box Drawing, Dingbats, etc do not qualify, that still leaves a lot of characters that probably are valid. How can we find the exact list of valid characters?

Step Numero ௧ (that’s a “TAMIL DIGIT ONE” U+0BE7)

The Unicode website has an online research tool, so I started there. Looking through the list of available character properties to filter on, I didn’t see anything for “letter”, but I did find two that looked promising: “Alphabetic” (a boolean) and “Number_Type” (an enumeration which has “Decimal” as an option). Both of those searches are the following two links, each one indicating the number of total code points / characters returned:

[:Age=3.2:] & [:BMP=Yes:] & [:Alphabetic=Yes:] (46,153 Code Points)

[:Age=3.2:] & [:BMP=Yes:] & [:Numeric_Type=Decimal:] (189 Code Points)

Please note:

  • “Age=3.2” includes all characters that were defined as of Unicode 3.2 (meaning it also includes characters added prior to version 3.2)
  • “BMP=Yes” filters out supplementary characters

I created several variables using a variety of letters from the first search and a variety of numbers from the second search, and everything worked. I then tried a few letters from a search that was mostly the same as the first search shown above, but did not have the “& [:BMP=Yes:]” specified so it returned supplementary characters that existed in Unicode 3.2 and are classified as “alphabetic”. The supplementary characters, even though they were alphabetic, did not work.

Everything appears to be working as described in the documentation.

So are we done? Can we call it a day and go home?

Well, are these things ever that easy and/or straight forward?

Step Numero ๒ (that’s a “THAI DIGIT TWO” U+0E52)

Quite often these types of things are not that easy. Yes, it is very tempting to assume that the limited test is good enough and that we did indeed find the exact list of characters (plus we would need to add in the four extra characters: at sign (@), dollar sign ($), number sign (#), and underscore (_)). However, based on my experiences, it seems that more often than not, doing an exhaustive test results in a slightly different answer that invalidates the previous conclusion (which was based on the limited test). So, while it does take more time to do more extensive testing, it seems like we have little choice if we truly want to know how these things actually work.

What that means is, at the very least, we need to get the complete list of characters accepted by SQL Server for non-delimited identifiers to make sure that the totals match the number of code points returned by the searches done in Step 1.

Of course, getting this list of accepted characters is a bit tricky because:

  1. The only way to determine if a character is valid is to not get an error when attempting to use it. So we need to test all 65,535 characters using Dynamic SQL within a TRY...CATCH construct, and record that ones that work in the TRY block.
  2. Testing by declaring a variable would work for “subsequent” characters (of an identifier), but it won’t work for determining “first” characters since the first character of a variable name is always going to be the @ sign.
  3. When testing for “subsequent” characters, there needs to be a valid character that follows the test character since whitespace characters won’t cause an error when placed at the end of an identifier (since they won’t be seen as being part of the identifier)

With all of that in mind, I tested for “first” characters by declaring a table variable and using the test character for a column name, as shown below:

USE [UnicodeTesting];
SET NOCOUNT ON;

-- DROP TABLE dbo.FirstCharacter;
CREATE TABLE dbo.FirstCharacter
(
  [CodePoint]    BINARY(2) NOT NULL
                   CONSTRAINT [PK_FirstChar] PRIMARY KEY CLUSTERED
                     WITH (FILLFACTOR = 100),
  [CodePointInt] AS (ISNULL(CONVERT(INT, [CodePoint]), -1)),
  [TheChar]      AS (ISNULL(NCHAR([CodePoint]), N''))
);


DECLARE @Index     INT = 1,
        @SQL       NVARCHAR(MAX),
        @Counter   INT = 0,
        @TheChar   NCHAR(1),
        @CodePoint NVARCHAR(10);

WHILE (@Index < 65536)
BEGIN
  SET @TheChar = NCHAR(@Index);
  SET @SQL = N'DECLARE @CharTest TABLE (' + @TheChar + N' INT);';
  SET @CodePoint = CONVERT(NVARCHAR(10), CONVERT(BINARY(2), @Index), 1);

  BEGIN TRY

    -- PRINT @SQL;
    EXEC (@SQL);
    SET @Counter += 1;
    INSERT INTO dbo.FirstCharacter ([CodePoint])
      VALUES (CONVERT(VARBINARY(2), @Index));
    RAISERROR(N'%5d: Code Point %s ( %s )', 10, 1,
              @Counter, @CodePoint, @TheChar) WITH NOWAIT;

  END TRY
  BEGIN CATCH
    SET @TheChar = N''; -- swallow (i.e. ignore) the error
  END CATCH;

  SET @Index += 1;
END;
GO

That ran for almost 4 minutes and captured 45,695 characters. Now, the search for “alphabetic” + “BMP” (i.e. non-supplementary) characters as of Unicode 3.2 returned 46,153 code points. And, if we add in the extra two characters noted in the documentation as valid for the first character — # and _ (@ wouldn’t be returned by the T-SQL test due to it not being a valid first character for anything but a variable name) — that gives us a target value of 46,155 characters that the T-SQL test should have returned. Yet, the T-SQL test found 460 characters less than what I was expecting. That’s not good, but we need to test the other half before drawing any conclusions.

Then I tested for valid “subsequent” characters by declaring a variable, placing the test character between two US English letters, as shown below:

-- DROP TABLE dbo.SubsequentCharacter;
SET NOCOUNT ON;
CREATE TABLE dbo.SubsequentCharacter
(
  [CodePoint]    BINARY(2) NOT NULL
                   CONSTRAINT [PK_SubsequentChar] PRIMARY KEY CLUSTERED
                     WITH (FILLFACTOR = 100),
  [CodePointInt] AS (ISNULL(CONVERT(INT, [CodePoint]), -1)),
  [TheChar]      AS (ISNULL(NCHAR([CodePoint]), N''))
);


DECLARE @Index     INT = 1,
        @SQL       NVARCHAR(MAX),
        @Counter   INT = 0,
        @TheChar   NCHAR(1),
        @CodePoint NVARCHAR(10);

WHILE (@Index < 65536)
BEGIN
  SET @TheChar = NCHAR(@Index);
  SET @SQL = N'DECLARE @Char' + @TheChar + N'Test INT;';
  SET @CodePoint = CONVERT(NVARCHAR(10), CONVERT(BINARY(2), @Index), 1);

  BEGIN TRY

    --PRINT @SQL;
    EXEC (@SQL);
    SET @Counter += 1;
    INSERT INTO dbo.SubsequentCharacter ([CodePoint])
      VALUES (CONVERT(VARBINARY(2), @Index));
    RAISERROR(N'%5d: Code Point %s ( %s )', 10, 1,
              @Counter, @CodePoint, @TheChar) WITH NOWAIT;

  END TRY
  BEGIN CATCH
    SET @TheChar = N''; -- swallow (i.e. ignore) the error
  END CATCH;

  SET @Index += 1;
END;
GO

That ran for 40 – 50 seconds and captured 46,543 characters. The search for “numeric_type = decimal” + “BMP” (i.e. non-supplementary) characters as of Unicode 3.2 returned 189 code points. Combining that with the total for valid “first” characters (i.e. 46,155), plus the two extra characters — @ and $ — gives us a target value of 46,346 characters that the T-SQL test should have returned. Yet, the T-SQL test found 197 characters more than what I was expecting. That’s also not good.

Clearly something is amiss. It could be the T-SQL tests, but I have gone over that process several times and it does indeed capture the correct characters. I have tested it on a SQL Server 2017 Developer instance with a Hebrew_100_BIN2 Collation, both SQL Server 2017 and 2012 Express LocalDB instances with a SQL_Latin1_General_CP1_CI_AS Collation, and a SQL Server 2017 Express instance with a Korean_100_CS_AS_KS_WS_SC Collation. The T-SQL results were identical in all cases.

So the problem is more likely to be either that:

  • I used the incorrect categorization(s) in my searches on the Unicode website, or
  • the categorization of certain characters has changed between Unicode versions 3.2 and 10.0, or
  • both of the above

Step Numero ໓ (that’s a “LAO DIGIT THREE” U+0ED3)

In order to check the categorizations of the characters, we need to get the exact list of Unicode 3.2 characters. To get the original Unicode 3.2 data files, I went to https://www.unicode.org/Public/3.2-Update/ and grabbed the following two files:

  • UnicodeData-3.2.0.txt (the main data)
  • UnicodeData-3.2.0.html (describes the format of the UnicodeData.txt file)

I used Excel to import the data (SSMS import wizard isn’t very flexible). Using Excel, I was able to first import the data file and split on the ; delimiters on the way in. Then I created a formula that uses CONCATENATE to format an INSERT INTO statement along with some other manipulation, such as prefixing the hex values with “Ox” and specifying NULL for empty fields that should be INT.

  • I specified “delimited”, to start on line 1, and that the file was encoded as “65001: Unicode (UTF-8)” because the newer documentation states that UTF-8 is the proper encoding of the data files. It mentions that “non-ASCII characters only appear in comments” though I don’t see any non-ASCII characters. So, it might be fine to accept the default encoding of “437 : OEM United States”.
  • On the next step I specified that the delimiter is ;
  • On the next / final step I specified that all fields are of type “text”.

Once it imported, I entered the following formula in cell Q2. I started with row 2 because I had added a header row. I used column Q because the data is 15 columns, A through O, and I left column P empty to make it easier to see where the actual data ended:

=(CONCATENATE("INSERT INTO [v3-2].UnicodeCharacterDatabase VALUES (0x",A2,", '",B2,"', '",C2,"', ",D2,", '",E2,"', '",F2,"', ",IF(G2="","NULL",G2),", ",IF(H2="","NULL",H2),", ",IF(I2="","NULL",CONCATENATE("'",I2,"'")),", ", IF(M2="","NULL",CONCATENATE("0x",M2)), ", ",IF(N2="","NULL",CONCATENATE("0x",N2)),", ",IF(O2="","NULL",CONCATENATE("0x",O2)), ");"))

Then I dragged the highlight around cell Q2, by the bottom-right corner, down to cell Q11636 so that it would apply that formula to all of the rows. Then I simply copied and pasted column Q into a SQL script that already included statements to create the Schema and Table. That script (on Pastebin.com) is:

Unicode 3.2, Part 1: Unicode Character Database (UCD)

The script is 1.68 MB and 11,784 lines, so give it a moment to load.

With that data alone we can do some pretty interesting queries, such as:

SELECT *
FROM   [v3-2].UnicodeCharacterDatabase ucd
WHERE  ucd.[TheChar] = N'R' COLLATE Latin1_General_100_CI_AI;
-- 43 rows

That’s right folks, there are 43 variations of the letter “R” (various accents, cases, directions, fonts, etc).

However, that data alone is not enough for the properties that we are looking for.

Step Numero ൪ (that’s a “MALAYALAM DIGIT FOUR” U+0D6A)

I went back to https://www.unicode.org/Public/3.2-Update/ and grabbed the following two files:

  • DerivedProperties-3.2.0.html (describes where to find, or how to determine, various properties)
  • DerivedCoreProperties-3.2.0.txt (lists of code points per various properties)

The “DerivedProperties” HTML file tells us two important things:

  1. we can find the “alphabetic” property in the “DerivedCoreProperties.txt” file
  2. we can determine “numeric type = decimal” from the data that we already imported: just find rows where “DecimalDigitValue”, “DigitValue”, and “NumericValue” all have a value.

Just like with the main UnicodeData file in Step 3, I imported the “DerivedCoreProperties.txt” file into Excel. There are only two columns, so I placed the following formula in cell D1 and applied it to all 4216 lines.

=(IF(OR(A1="",LEFT(A1,1)="#"),"", IF(LEN(TRIM(A1)) = 10, CONCATENATE("OR ucd.[CodePoint] BETWEEN 0x", REPLACE(TRIM(A1), 5, 2, " AND 0x")), CONCATENATE("OR ucd.[CodePoint] = 0x", TRIM(A1))) ))

This formula ignores empty lines and comment lines (i.e. lines starting with #), and then formats the remaining lines to match the specified code points. I then copied and pasted column D (just rows 165 – 573 as they are for the “Alphabetic” property and are not supplementary characters) into a script that added a BIT column for IsAlphabetic to the main UnicodeCharacterDatabase table that was created in Step 3, and did a simple UPDATE to set the matching code points to 1.

The script then adds a BIT column for IsDecimal (we don’t need to worry about the other numeric types, so a boolean will suffice) and does a simple update to set rows where the “DecimalDigitValue”, “DigitValue”, and “NumericValue” columns are all NOT NULL.

That script (on Pastebin.com, and only 22 KB) is:

Unicode 3.2, Part 2: Derived Props (Alpha & Dec)

We now have all of the Unicode 3.2 characters loaded, and have the two additional properties that we were looking for. Woo hoo! We have finally gathered enough data to do some meaningful tests and comparisons.

Step Numero ౫ (that’s a “TELUGU DIGIT FIVE” U+0C6B)

The following two queries compare the T-SQL “first” character list to the “alphabetic” characters, and the T-SQL “subsequent” character list to both the “alphabetic” and “numeric_type = decimal” characters, to find the ones that are on only one side. Hopefully, seeing which characters aren’t matching up will point us in the right direction.

SELECT CASE WHEN fc.[CodePoint] IS NULL THEN 'Missing'
            ELSE 'Extra' END AS [␦],
       COALESCE(fc.[CodePoint], ucd.[CodePoint]) AS [CodePoint],
       COALESCE(fc.[TheChar], ucd.[TheChar]) AS [TheChar],
       fc.[CodePoint] AS [fc],
       ucd.[CodePoint] AS [ucd],
       '---' AS [---],
       ucd.*
FROM dbo.FirstCharacter fc
FULL JOIN [v3-2].[UnicodeCharacterDatabase] ucd
       ON ucd.[CodePoint] = fc.[CodePoint]
WHERE (  -- find extra
         fc.[CodePoint] IS NOT NULL
     AND ucd.[IsAlphabetic] IS NULL
       )
OR    (  -- find missing
         fc.[CodePoint] IS NULL
     AND ucd.[IsAlphabetic] = 1
       )
ORDER BY [␦], COALESCE(fc.[CodePoint], ucd.[CodePoint]);
-- Off by 361: 3 extra (expected 2), and 358 missing (unexpected)
-- Extra:      0x0023 (#), 0x005F (_),
--             0xFF3F (_ ; FULLWIDTH LOW LINE ; unexpected)


SELECT CASE WHEN sc.[CodePoint] IS NULL THEN 'Missing'
            ELSE 'Extra' END AS [␦],
       COALESCE(sc.[CodePoint], ucd.[CodePoint]) AS [CodePoint],
       COALESCE(sc.[TheChar], ucd.[TheChar]) AS [TheChar],
       sc.[CodePoint] AS [sc],
       ucd.[CodePoint] AS [ucd],
       '---' AS [---],
       ucd.*
FROM dbo.SubsequentCharacter sc
FULL JOIN [v3-2].[UnicodeCharacterDatabase] ucd
       ON ucd.[CodePoint] = sc.[CodePoint]
WHERE (  -- find extra
         sc.[CodePoint] IS NOT NULL
     AND ucd.[IsAlphabetic] IS NULL
     AND ucd.[IsDecimal] IS NULL
       )
OR    (  -- find missing
         sc.[CodePoint] IS NULL
     AND (
          ucd.[IsAlphabetic] = 1
      OR  ucd.[IsDecimal] = 1
         )
       )
ORDER BY [␦], COALESCE(sc.[CodePoint], ucd.[CodePoint]);
-- Off by 315: 295 extra (expected 4), and 20 missing (unexpected)
-- Expected extra: 0x0023 (#), 0x0024 ($), 0x0040 (@), 0x005F (_)

Um, ok. So now we have two problems: not only are the different sources (T-SQL vs Unicode 3.2 data) still not matching up, but now they are off by different amounts than they were based on the online searches. We should probably verify the number of characters in both the “Alphabetic” and “Decimal” categories to see how they match up to what we found online.

SELECT SUM(CONVERT(INT, ucd.[IsAlphabetic])) AS [Alphabetic],
       SUM(CONVERT(INT, ucd.[IsDecimal])) AS [Numeric_Type=Decimal]
FROM   [v3-2].UnicodeCharacterDatabase ucd;
-- Alphabetic    Numeric_Type=Decimal
-- 46050         218

Well, this confirms one of the possible explanations noted at the end of Step 2: properties can change across versions of Unicode. When Unicode version 3.2 was published, there were 46,050 characters classified as “Alphabetic”. As of version 10.0 of Unicode (the version used by the online tool), out of that same pool of characters that were available as of version 3.2, now 46,153 of them are classified as “Alphabetic”. That is a net increase of 103 characters in that classification (I say “net” because some characters may have been removed from the classification).

Similarly, when Unicode version 3.2 was published, there were 218 characters classified as “Numeric_Type=Decimal”. In Unicode version 10.0, out of that same pool of characters that were available as of version 3.2, now only 198 of them are classified as “Numeric_Type=Decimal”. That is a decrease of 20 characters in that classification.

We will take a closer look at various differences in classifications of the same character across versions of Unicode later. For now, it is clear that specifying [:Age=3.2:] in the online search is merely a filter to get the characters available as of that version. It does not imply that any other property being filtered on will use values as they were back in version 3.2; the property values being used are the current 10.0 values (or whatever version is indicated at the bottom of the “UnicodeSet” page). Meaning, the online search cannot be used for any historical research outside of simply seeing which characters were available as of a particular version of Unicode. All historical research related to behavior can only be done via the original data files.

Intermission

Please join us next time for the exciting conclusion (including a link to the complete list of valid characters)…

4 thoughts on “The Uni-Code: The Search for the True List of Valid Characters for T-SQL Regular Identifiers, Part 1”

Thoughts? Questions? Comments? Suggestions? Words of praise?

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s