(last updated: 2018-04-06)
Rules for Regular Identifiers
The question that I’m trying to answer is: what are the valid “letters” and “decimal numbers” from other national scripts?
I tried using the online research tool “UnicodeSet”, but that gave slightly different results compared (using the “alphabetic” and “numeric_type = decimal” properties) to what I discovered SQL Server actually accepts.
I then loaded the actual Unicode 3.2 data files only to find that the number of characters having either the “alphabetic” or “numeric_type = decimal” properties was different than both the online search and what SQL Server actually accepts.
Step Numero ৬ (that’s a “BENGALI DIGIT SIX” U+09EC)
The good news is that we have confirmed that we are on the correct path regarding using the original Unicode 3.2 data for the research. But, we still need to find the correct classification (because the list of valid characters returned by the T-SQL tests is certainly not arbitrary).
In attempting to import the initial data file in Step 3, I got a closer look at the data and the documentation for it. I found that filtering on the “GeneralCategory” column could be quite useful (and in fact, I have used those general categories before in RegEx patterns looking for “upper-case letters”, or “punctuation”, etc).
The “GeneralCategory” includes various types of letters (upper-case, lower-case, etc) and various types of numbers (decimal, letterlike, etc). This looks like it will make things easier. We just want all letter types (seems to fit better with the wording of the Microsoft documentation for Identifiers anyway) and decimal numbers. The following two queries are similar to the previous two queries except they are filtering the Unicode data on the “GeneralCategory” column, using:
LIKE 'L%'(to get all letter types) instead of
[IsAlphabetic] = 1
= 'Nd'instead of
[IsDecimal] = 1
SELECT CASE WHEN fc.[CodePoint] IS NULL THEN 'Missing' ELSE 'Extra' END AS [␦], COALESCE(fc.[CodePoint], ucd.[CodePoint]) AS [CodePoint], COALESCE(fc.[TheChar], ucd.[TheChar]) AS [TheChar], fc.[CodePoint] AS [fc], ucd.[CodePoint] AS [ucd], '---' AS [---], ucd.* FROM dbo.FirstCharacter fc FULL JOIN [v3-2].[UnicodeCharacterDatabase] ucd ON ucd.[CodePoint] = fc.[CodePoint] WHERE ( -- find extra fc.[CodePoint] IS NOT NULL AND ucd.[GeneralCategory] NOT LIKE 'L%' ) OR ( -- find missing fc.[CodePoint] IS NULL AND ucd.[GeneralCategory] LIKE 'L%' ) ORDER BY [␦], COALESCE(fc.[CodePoint], ucd.[CodePoint]); -- Off by 55: all extra (expected 2) -- Expected extra: 0x0023 (#), 0x005F (_) SELECT CASE WHEN sc.[CodePoint] IS NULL THEN 'Missing' ELSE 'Extra' END AS [␦], COALESCE(sc.[CodePoint], ucd.[CodePoint]) AS [CodePoint], COALESCE(sc.[TheChar], ucd.[TheChar]) AS [TheChar], sc.[CodePoint] AS [sc], ucd.[CodePoint] AS [ucd], '---' AS [---], ucd.* FROM dbo.SubsequentCharacter sc FULL JOIN [v3-2].[UnicodeCharacterDatabase] ucd ON ucd.[CodePoint] = sc.[CodePoint] WHERE ( -- find extra sc.[CodePoint] IS NOT NULL AND ucd.[GeneralCategory] NOT LIKE 'L%' AND ucd.[GeneralCategory] <> 'Nd' ) OR ( -- find missing sc.[CodePoint] IS NULL AND ( ucd.[GeneralCategory] LIKE 'L%' OR ucd.[GeneralCategory] = 'Nd' ) ) ORDER BY [␦], COALESCE(sc.[CodePoint], ucd.[CodePoint]); -- Off by 705: all extra (only expected 4) -- Expected extra: 0x0023 (#), 0x0024 ($), 0x0040 (@), 0x005F (_)
The query for “first” characters got much closer. But, the query for “subsequent” characters is now much farther off. So we are still missing something. However, with the “first” characters being so close, I figured I might could take one of the “extra” characters and try to find if any other derived properties might include it. There are several derived properties and so far I have only looked at “Alphabetic”. And, if that character does show up in a few other derived property lists, then I can take another one of the extra characters to see which derived property list(s) it is in, and eliminiate any derived property that does not include all of them, which should (hopefully) leave me with the correct property (or at worst just 2 or 3).
Step Numero ૭ (that’s a “GUJARATI DIGIT SEVEN” U+0AED)
I went back to the “DerivedCoreProperties.txt” file, and the first property I came across (that contained one of the “extra” characters) was “ID_Start”. I had seen this before in the property list for the online search, but never paid much attention to it. This time I paid attention. The brief description of this property in the text file is:
# Characters that can start an identifier. # Generated from Lu+Ll+Lt+Lm+Lo+Nl
Seriously? They even use the word “identifier”!?!? And look, the property is derived from all letters (as I had tried) plus letterlike numeric characters, which I had not tried and could account for the 53 extra characters in the T-SQL test.
And it gets better: the next derived property is “ID_Continue”, and it’s brief description is:
# Characters that can continue an identifier. # Generated from: ID_Start + Mn+Mc+Nd+Pc # NOTE: Cf characters should be filtered out.
Very interesting indeed. Fortunatley, it should be fairly easy to add these two properties to the
UnicodeCharacterDatabase table since it doesn’t require getting anything new. I went back to the Excel file I created in Step 4 which had column D formatted for all properties. This time I grabbed rows 1485 – 1776 (BMP code points for “ID_Start”). I have a script that first adds a
BIT column for
IDStart to the
UnicodeCharacterDatabase table, and then does an
UPDATE using the rows I just grabbed from Excel as the
WHERE clause. I repeated that pattern for a new
IDContinue column, using rows 1825 – 2293 (BMP code points for “ID_Continue”) from column D of the Excel file for the
That script (on Pastebin.com, and only 36 KB) is:
Ok. So now, (again!) we should have all of the pieces of the puzzle collected to finally solve this mystery.
Step Numero ፰ (that’s a “ETHIOPIC DIGIT EIGHT” U+1370)
We will use the same basic queries as the previous two attempts, except this time we will use the new
IDStart column instead of either
[GeneralCategory] LIKE 'L%', and the new
IDContinue column instead of either
[GeneralCategory] = 'Nd'.
Don’t want to get too excited….
So, executing the following two queries:
SELECT CASE WHEN fc.[CodePoint] IS NULL THEN 'Missing' ELSE 'Extra' END AS [␦], COALESCE(fc.[CodePoint], ucd.[CodePoint]) AS [CodePoint], COALESCE(fc.[TheChar], ucd.[TheChar]) AS [TheChar], fc.[CodePoint] AS [fc], ucd.[CodePoint] AS [ucd], '---' AS [---], ucd.* FROM dbo.FirstCharacter fc FULL JOIN [v3-2].[UnicodeCharacterDatabase] ucd ON ucd.[CodePoint] = fc.[CodePoint] WHERE ( -- find extra fc.[CodePoint] IS NOT NULL AND ucd.[IDStart] IS NULL ) OR ( -- find missing fc.[CodePoint] IS NULL AND ucd.[IDStart] = 1 ) ORDER BY [␦], COALESCE(fc.[CodePoint], ucd.[CodePoint]); -- Off by 3: all extra (only expected 2) -- Extra: 0x0023 (#), 0x005F (_), 0xFF3F (＿ ; unexpected) SELECT CASE WHEN sc.[CodePoint] IS NULL THEN 'Missing' ELSE 'Extra' END AS [␦], COALESCE(sc.[CodePoint], ucd.[CodePoint]) AS [CodePoint], COALESCE(sc.[TheChar], ucd.[TheChar]) AS [TheChar], sc.[CodePoint] AS [sc], ucd.[CodePoint] AS [ucd], '---' AS [---], ucd.* FROM dbo.SubsequentCharacter sc FULL JOIN [v3-2].[UnicodeCharacterDatabase] ucd ON ucd.[CodePoint] = sc.[CodePoint] WHERE ( -- find extra sc.[CodePoint] IS NOT NULL AND ucd.[IDContinue] IS NULL ) OR ( -- find missing sc.[CodePoint] IS NULL AND ucd.[IDContinue] = 1 ) ORDER BY [␦], COALESCE(sc.[CodePoint], ucd.[CodePoint]); -- Off by 29: all extra (only expected 3) -- Expected extra: 0x0023 (#), 0x0024 ($), 0x0040 (@)
results in a nearly perfect match in both cases!! The first query, for “first” characters, shows nothing missing, and only 3 characters found in T-SQL that do not have the “ID_Start” property (remember, we aren’t concerned with the at sign
@ since it can only be used as a “first” character for variables / parameters). Two of those characters are noted in the Microsoft documentation: the number sign (#) and the underscore (_ ; officially named “Low Line”). That leaves only ONE character that is unaccounted for: 0xFF3F (＿ ; Fullwidth Low Line). Well, that makes sense-enough, I suppose. We already knew that the underscore / low line character was valid, and this just happens to be the fullwidth form of that character. We can even verify that with the following query:
SELECT * FROM [v3-2].UnicodeCharacterDatabase ucd WHERE ucd.[TheChar] = N'_' COLLATE Latin1_General_100_CI_AI;
That returns two rows: one for 0x005F “LOW LINE”, and one for 0xFF3F “FULLWIDTH LOW LINE”. Even better: the
CharacterDecompositionMapping column of the “FULLWIDTH LOW LINE” row has the following note in it: “<wide> 005F” (which points back to the code point for “LOW LINE”).
“First” characters: SOLVED!
Now, the second query, for “subsequent” characters, is a little trickier. Those results show that while, again, nothing is missing, this time there are 29 extra characters that SQL Server accepts that do not have the “ID_Continue” property. Only three of those characters are noted in the Microsoft documentation: the number sign (#) and the dollar sign ($), and the at sign (@). Neither the underscore (a.k.a. “low line”, and which is the fourth special character for “subsequent” characters mentioned in the documentation), nor the fullwidth underscore, are in the results, but that just means that they have the “ID_Continue” property and don’t need to be mentioned separately. But what about the remaining 26 extra characters?
Those remaining 26 characters all have a common property: they all have a “GeneralCategory” value of “Cf”. A value of “Cf” means “format control character”, and there are only 26 characters in this category, so this is all of them. Interestingly enough, the “DerivedProperties.html” file that we looked at in Step 4, has a note in the “Definition” column for the “ID_Continue” property indicating that there is some “Cf”-specific information. That “Cf Note” states:
|The general category Cf characters are not included in ID_Continue nor in XID_Continue; they should continue identifiers, but be filtered out of the result.|
Meaning, these 26 characters can affect the visible display of the string, so they should be allowed in such a way as to not cause an error, but they should also not be used to provide meaningful distinction between identifiers. As far as my limited testing of them indicates, this is mostly how they are being handled. I did find one case where they do allow for distinguishing between two identifiers that are otherwise the same: when using a binary Collation. Either way, the Unicode Standard is a recommendation, and Microsoft implemented it in the way that they did.
“Subsequent” characters: SOLVED!
And with that, I present to you, the grand, all-encompasing, list of all valid characters for T-SQL regular Identifiers:
Completely Complete List of Valid T-SQL Identifier Characters (please give the page a moment to load; it’s 3.5 MB and almost 47k lines)
Well, it looks like that’s it folks. Time to pat ourselves on the back and call it a day (or several). Everything wrapped up nice and neatly, right? Right. Ok, goodb…wait, wasn’t there a question from Part 1? Something about how / why the Unicode 3.2 classifications didn’t match the Unicode 10.0 classifications, even though we were only looking at the characters that were available as of 3.2 (so they should be the same characters)?
Come on. By now everyone should be conditioned by TV and movies to expect that the story isn’t over when it first seems to reach its conclusion. 😼 🙃
Step Numero ੯ (that’s a “GURMUKHI DIGIT NINE” U+0A6F)
At this point we should take a quick peek behind the green curtain, so-to-speak, and see how some of this stuff works (or should work).
Online Search Glitch
First, when using the online search, please be aware that when filtering on Age=x.y to only look at characters available in Unicode version x.y, it seems that there is a bug and the results will still include version 10.0 characters. For example, if you search on BMP characters as of Unicode 3.2, you get back a list of 58,782 code points. If you also group by “age”, you can scroll to the end (or just search on the page for “Age=10“) to see that there are 57 code points for Unicode 10.0, even though there is nothing else after the 3.2 characters outside of that small group. The following link will take you to those results:
Going back to the
[v3-2].UnicodeCharacterDatabase table that we created in Step 3, we should be able to see how many characters were truly available as of Unicode 3.2. That table currently does not have all of the characters in it. There are four ranges of characters (surrogates and privates) that we did not expand earlier as they wouldn’t match anything anyway. But we can do that now to get an accurate count:
;WITH cte AS ( SELECT TOP (70000) ROW_NUMBER() OVER (ORDER BY (SELECT 0)) AS [num] FROM master.sys.objects c1 CROSS JOIN master.sys.columns c2 ORDER BY [num] ) INSERT INTO [v3-2].UnicodeCharacterDatabase ([CodePoint], [CharacterName], [GeneralCategory], [CanonicalCombiningClasses], [BidirectionalCategory], [CharacterDecompositionMapping]) SELECT cte.[num], ranges.[name], ranges.[category], 0, 'L', '' FROM cte INNER JOIN ( VALUES ('<Non Private Use High Surrogate>', 0xD801, 0xDB7E, 'Cs'), ('<Private Use High Surrogate>', 0xDB81, 0xDBFE, 'Cs'), ('<Low Surrogate>', 0xDC01, 0xDFFE, 'Cs'), ('<Private Use>', 0xE001, 0xF8FE, 'Co') ) ranges([name], [start], [end], [category]) ON cte.[num] BETWEEN ranges.[start] AND ranges.[end]; SELECT COUNT(*) AS [TotalCodePoints] FROM [v3-2].UnicodeCharacterDatabase db;
That returns a value of 58,725. That value just happens to be 57 less than the number of code points returned by the online search, which in turn just happens to be the number of code points in the “Age=10.0” group at the bottom of the online results.
The core documentation for the base data, Unicode® Standard Annex #44: Unicode Character Database (Revision 20 for Unicode 10.0), provides some useful insights (as one would hope / expect) into what is, and is not, causing the behavior that we are experiencing.
In section 2.1.2 Derived Properties it states:
|Implementations should simply use the derived properties, and should not try to rederive them from lists of simple properties and collections of rules, because of the chances for error and divergence when doing so.
Definitions of property derivations are provided for information only, typically in comment fields in the data files. Such definitions may be refactored, refined, or corrected over time.
For example, the Unicode 3.2 “DerivedCoreProperties.txt” file contains, for the “ID_Start” and “ID_Continue” properties, respectively, the following:
# Generated from Lu+Ll+Lt+Lm+Lo+Nl # Generated from: ID_Start + Mn+Mc+Nd+Pc
However, the Unicode 10.0 file contains, for the “ID_Start” and “ID_Continue” properties, respectively, the following:
# Generated from: # Lu + Ll + Lt + Lm + Lo + Nl # + Other_ID_Start # - Pattern_Syntax # - Pattern_White_Space # Generated from: # ID_Start # + Mn + Mc + Nd + Pc # + Other_ID_Continue # - Pattern_Syntax # - Pattern_White_Space
This is why I followed their recommendation and imported the list of code points for those two properties, rather than doing a simple
UPDATE based on the
In section 2.3.1 Changes to Properties Between Releases it states:
|While the Unicode Consortium endeavors to keep the values of all character properties as stable as possible between versions, occasionally circumstances may arise which require changing them.|
In section 5.10 Invariants it states:
|Property values in the UCD may be subject to correction in subsequent versions of the standard, as errors are found…However, some property values and some aspects of the file formats are considered invariant.|
In section 5.10.1 Character Property Invariants it states:
|Some character properties are simply considered immutable: once assigned, they are never changed.|
The “Age” property is listed as immutable, but the “General_Category” property is not.
In that same section it goes on to state:
|In Unicode 4.0 and thereafter, the General_Category value Decimal_Number (Nd), and the Numeric_Type value Decimal (de) are defined to be co-extensive; that is, the set of characters having General_Category=Nd will always be the same as the set of characters having NumericType=de.|
This explains why the online search returns the same set for both of those properties, yet the data we imported for Unicode 3.2 returns 20 more characters for
IsDecimal = 1 than it does for
There is an entire technical document on “Identifiers”, Unicode® Standard Annex #31: UNICODE IDENTIFIER AND PATTERN SYNTAX, which states that across versions of the Unicode Standard, characters can indeed be reclassified, within certain parameters. Here is a summary of section 1.1 Stability:
- ID_Start can gain characters over time from:
- re-classification of ID_Nonstart characters (ID_Continue – ID_Start)
- re-classification of existing characters currently classified as “Other”
- characters added in the future
- ID_Continue can gain characters over time from:
- existing characters currently classified as “Other”
- characters added in the future
- Once a character is classified as “ID_Start” it won’t go back
Given that the “ID_Start” property is derived from which characters are in several of the “General Categories”, and categorization might change over time, the mechanism that allows for that last point regarding “once ID_Start, always ID_Start” is described in section 2.5 Backward Compatibility, which states:
|…the Other_ID_Start property provides a small list of characters that qualified as ID_Start characters in some previous version of Unicode solely on the basis of their General_Category properties, but that no longer qualify in the current version. These are called grandfathered characters.
Similarly, the Other_ID_Continue property adds a small list of characters that qualified as ID_Continue characters in some previous version of Unicode solely on the basis of their General_Category properties, but that no longer qualify in the current version.
You can see the “Other_ID_Start” and “Other_ID_Continue” properties in the Unicode 10.0 definitions of the “ID_Start” and “ID_Continue” properties in the UCD Documentation section directly above.
And while it does not impact changes to classifications over time, since we are looking at the documentation for Identifiers, the next section, 1.2 Customization, states:
|…different programming languages have different conventions for the use of certain characters such as $, @, #, and _ in identifiers.|
Interesting four characters used in that example, eh?