SQL Server Collations: What does “CP1” mean in “SQL_Latin1_General_CP1_CI_AS”?


(last updated: 2021-06-28 @ 02:30 EST / 2021-06-28 @ 06:30 UTC )

Many of us that work with Microsoft SQL Server, especially those of us working on systems with US English as the OS language, or anyone working with SQL Server Express LocalDB (commonly referred to as just “LocalDB”), have had to deal with the SQL_Latin1_General_CP1_CI_AS collation.

Collation names, at least in SQL Server, are structured in that they contain various pieces of information detailing some of the properties of the collation. The collation name of SQL_Latin1_General_CP1_CI_AS can be broken down as follows:

  • SQL_ This indicates that the collation is a SQL Server collation, while names without this prefix indicate Windows collations
  • Latin1_General This is the culture (a.k.a. locale) used for the sorting and comparison rules
  • CP1 This is the 8-bit [C]ode [P]age
  • CI This indicates [C]ase- [S]ensitive or [I]nsensitive
  • AS This indicates [A]ccent- [S]ensitive or [I]nsensitive

Ok, but there’s no code page "1". So, what’s going on there?

TL; DR: Skip to “Summary”

Look a Little Deeper

Let’s take a closer look at the actual code page values for each name:

SELECT col.[name],
       SUBSTRING(col.[name],
                 5,
                 ISNULL(NULLIF(CHARINDEX(N'_Pref', col.[name]), 0),
                        CHARINDEX(N'_CP', col.[name]))
                            - 5) AS [Locale / Culture],
       SUBSTRING(col.[name],
                 CHARINDEX(N'_CP', col.[name]),
                 PATINDEX(N'%[_][BC][IS]%', col.[name]) -
                     CHARINDEX(N'_CP',
                 col.[name]) + 1) AS [CP from name],
       COLLATIONPROPERTY(col.[name], 'CodePage') AS [CodePage]
FROM   sys.fn_helpcollations() col
WHERE  col.[name] LIKE N'SQL[_]%'
ORDER BY [CodePage], col.[name];
-- 77 rows

The query shown above returns the result set shown below (I removed approximately 37 rows that didn’t show any meaningful variation):

                                  Locale /         CP          Code
name                              Culture          from name   Page
-------------------------------   --------------   ---------   ----
SQL_Latin1_General_CP437_CI_AS    Latin1_General   _CP437_      437
SQL_1xCompat_CP850_CI_AS          1xCompat         _CP850_      850
SQL_AltDiction_CP850_CI_AI        AltDiction       _CP850_      850
SQL_Latin1_General_CP850_BIN      Latin1_General   _CP850_      850
SQL_Scandinavian_CP850_CS_AS      Scandinavian     _CP850_      850
SQL_Croatian_CP1250_CS_AS         Croatian         _CP1250_    1250
SQL_Czech_CP1250_CS_AS            Czech            _CP1250_    1250
SQL_Hungarian_CP1250_CS_AS        Hungarian        _CP1250_    1250
SQL_Latin1_General_CP1250_CS_AS   Latin1_General   _CP1250_    1250
SQL_Polish_CP1250_CI_AS           Polish           _CP1250_    1250
SQL_Romanian_CP1250_CI_AS         Romanian         _CP1250_    1250
SQL_Slovak_CP1250_CS_AS           Slovak           _CP1250_    1250
SQL_Slovenian_CP1250_CS_AS        Slovenian        _CP1250_    1250
SQL_Latin1_General_CP1251_CS_AS   Latin1_General   _CP1251_    1251
SQL_Ukrainian_CP1251_CI_AS        Ukrainian        _CP1251_    1251
SQL_Danish_Pref_CP1_CI_AS         Danish           _CP1_       1252
SQL_EBCDIC037_CP1_CS_AS           EBCDIC037        _CP1_       1252
SQL_EBCDIC1141_CP1_CS_AS          EBCDIC1141       _CP1_       1252
SQL_EBCDIC273_CP1_CS_AS           EBCDIC273        _CP1_       1252
SQL_EBCDIC277_2_CP1_CS_AS         EBCDIC277_2      _CP1_       1252
SQL_EBCDIC277_CP1_CS_AS           EBCDIC277        _CP1_       1252
SQL_EBCDIC278_CP1_CS_AS           EBCDIC278        _CP1_       1252
SQL_EBCDIC280_CP1_CS_AS           EBCDIC280        _CP1_       1252
SQL_EBCDIC284_CP1_CS_AS           EBCDIC284        _CP1_       1252
SQL_EBCDIC285_CP1_CS_AS           EBCDIC285        _CP1_       1252
SQL_EBCDIC297_CP1_CS_AS           EBCDIC297        _CP1_       1252
SQL_Icelandic_Pref_CP1_CI_AS      Icelandic        _CP1_       1252
SQL_Latin1_General_CP1_CI_AI      Latin1_General   _CP1_       1252
SQL_SwedishPhone_Pref_CP1_CI_AS   SwedishPhone     _CP1_       1252
SQL_SwedishStd_Pref_CP1_CI_AS     SwedishStd       _CP1_       1252
SQL_AltDiction2_CP1253_CS_AS      AltDiction2      _CP1253_    1253
SQL_Latin1_General_CP1253_CI_AI   Latin1_General   _CP1253_    1253
SQL_MixDiction_CP1253_CS_AS       MixDiction       _CP1253_    1253
SQL_Latin1_General_CP1254_CS_AS   Latin1_General   _CP1254_    1254
SQL_Latin1_General_CP1255_CI_AS   Latin1_General   _CP1255_    1255
SQL_Latin1_General_CP1256_CS_AS   Latin1_General   _CP1256_    1256
SQL_Estonian_CP1257_CI_AS         Estonian         _CP1257_    1257
SQL_Latin1_General_CP1257_CS_AS   Latin1_General   _CP1257_    1257
SQL_Latvian_CP1257_CI_AS          Latvian          _CP1257_    1257
SQL_Lithuanian_CP1257_CS_AS       Lithuanian       _CP1257_    1257

From those results we can clearly see that CP1 is code page 1252 (officially known as “Windows-1252”).

Um, ok, but why is it "CP1" in the collation name instead of "CP1252"? It’s the only one that doesn’t fit the pattern. Why is it special?

Clue # אֶחָד (Echad) ( 1 )

By querying INFORMATION_SCHEMA.COLUMNS we can see the character set name (this is the name associated with the code page):

DECLARE @CP1 TABLE ([_CollationCodePageTest_] VARCHAR(50)
                        COLLATE SQL_Latin1_General_CP1_CI_AS);

SELECT col.[TABLE_NAME], col.[CHARACTER_SET_NAME], col.[COLLATION_NAME]
FROM   tempdb.INFORMATION_SCHEMA.COLUMNS col
WHERE  col.[COLUMN_NAME] = N'_CollationCodePageTest_';
/*
TABLE_NAME     CHARACTER_SET_NAME     COLLATION_NAME
#B03667F2      iso_1                  SQL_Latin1_General_CP1_CI_AS
*/

And, if your instance is using a CP1 collation as the instance (i.e. server) -level collation (and sadly that is the only option for SQL Server Express LocalDB), then you can use the following query to also see the sort order name:

SELECT SERVERPROPERTY('Collation') AS [Collation],
       SERVERPROPERTY('SqlCharSetName') AS [SqlCharSetName],
       SERVERPROPERTY('SqlSortOrderName') AS [SqlSortOrderName];

/*
Collation                       SqlCharSetName    SqlSortOrderName
SQL_Latin1_General_CP1_CI_AS    iso_1             nocase_iso
*/

Well, that’s not very helpful. Or, maybe it is? I mean, the "1" matches, but the "iso" part doesn’t really narrow it down. We can safely assume that this does not mean the actual ISO 1 standard: “Geometrical product specifications”. But, the International Organization for Standardization (i.e. ISO) currently maintains 50 active standards within the "Coding of Character Sets" category.

Also, all of the other "SqlCharSetName" values match the following pattern:

  1. Non-Unicode code pages: "cp{code_page_number}" (i.e. same as in the collation name, just with lower-case "cp")
  2. UTF-8 code pages (starting in SQL Server 2019): "utf8"
  3. Unicode-only collations: "<NULL>"

It is possible, however, that "iso" refers to the ISO/IEC 8859 series of encodings, given that “Part 1” of that series is “Latin-1”. The Wikipedia article for the “Windows-1252” encoding states:

This character encoding is a superset of ISO 8859-1 in terms of printable characters, but differs from the IANA's ISO-8859-1 by using displayable characters rather than control characters in the 80 to 9F (hex) range. Notable additional characters include curly quotation marks and all the printable characters that are in ISO 8859-15 (at different places than ISO 8859-15). It is known to Windows by the code page number 1252, and by the IANA-approved name "windows-1252".

It is very common to mislabel Windows-1252 text with the charset label ISO-8859-1. A common result was that all the quotes and apostrophes (produced by "smart quotes" in word-processing software) were replaced with question marks or boxes on non-Windows operating systems, making text difficult to read. Most modern web browsers and e-mail clients treat the media type charset ISO-8859-1 as Windows-1252 to accommodate such mislabeling. This is now standard behavior in the HTML5 specification, which requires that documents advertised as ISO-8859-1 actually be parsed with the Windows-1252 encoding.

For anyone who’s curious, here’s an example of the confusion surrounding these two character sets / encodings / code pages. It’s a W3C discussion concerning their HTML validator from May of 2001: RE: iso-8859-1-Windows-3.1-Latin-1.

Could it be that Microsoft mislabeled “Windows-1252” (their own encoding) with “ISO-8859-1”?

Clue Numero Dos ( 2 )

Part A

Recently, while researching another topic, I came across the following SQL Server 7.0 documentation (published in 1998):

Code Page 1252 (ISO Character Set)

Code page 1252 (ISO character set) is the default character set. It is also known as the ISO 8859-1, Latin 1, or ANSI character set. It is compatible with the ANSI characters used by the Microsoft® Windows NT® and Microsoft Windows® operating systems.

( Download self-extracting zip file containing the SQL Server 7.0 Books Online documentation )

Part B

I also found the following in the SQL Server 7.0 documentation:

Sort Order IDs

1252 ISO character set

ID   Name             Description
--   ---------------  ----------------------------------
51   dictionary_iso   Dictionary order, case-sensitive
52   nocase_iso       Dictionary order, case-insensitive
...
71   dictionary_1252  Latin-1 case-sensitive
72   nocase_1252      Latin-1 case-insensitive

The confusion is more appearant here than anywhere else: the group label equates "1252" and "iso", so why would there be separate Sort Order IDs for them?

But wait, there’s more! I found the following in the SQL Server 2000 documentation (please note that the right column in this first listing is mislabeled as it’s really the description, not the name):

Selecting a SQL Collation

Sort
order ID   SQL collation name
--------   --------------------------------------------------------------------
...
51         Dictionary order, case-sensitive, for use with 1252 character set.
52         Dictionary order, case-insensitive, for use with 1252 character set.
...
71         Latin-1 case-sensitive, for use with 1252 character set.
72         Latin-1 case-insensitive, for use with 1252 character set.
...

Notice how Sort Order IDs 51 and 52 are now "1252" instead of "iso". It also appears that Sort Order IDs 71 and 72 are duplicates of 51 and 52, respectively.

Part C

In Parts “A” and “B” we looked at documentation. But what does SQL Server have to say? Let’s ask:

-- SQL Server 2000:
SELECT *,
       CONVERT(TINYINT,
               RTRIM(SUBSTRING(col.[description],
                               CHARINDEX(N'Order', col.[description]) + 6,
                               3))) AS [SortID]
FROM   ::fn_helpcollations() col
WHERE  col.[name] LIKE N'SQL%'
ORDER BY [SortID];


-- -- SQL Server 2005 and newer:
-- SELECT *, COLLATIONPROPERTY(col.[name], 'sortid') AS [SortID]
-- FROM   sys.fn_helpcollations() col
-- WHERE  COLLATIONPROPERTY(col.[name], 'sortid') <> 0
-- ORDER BY [SortID];

returns:

...
SQL_Latin1_General_CP1_CS_AS	..., SQL Server Sort Order 51 on Code Page 1252 for non-Unicode Data
SQL_Latin1_General_CP1_CI_AS	..., SQL Server Sort Order 52 on Code Page 1252 for non-Unicode Data
...

As we suspected, the "CP1" collations are associated with Sort Order IDs that are now labeled as "1252", but were previously labeled as "iso".

ALSO:

  1. This (i.e. SQL Server 2000 and its documentation) is the introduction of "CP1", which almost appears to be a compromise between switching fully over to what should have been "CP1252" (to match all of the others) and keeping some part of "iso_1" (because that’s still how it’s labeled internally?).
  2. There are no collations that use Sort Order IDs 71 and 72. Those only exist in the documentation, not within SQL Server itself!

And finally:

SELECT [description]
FROM   dbo.syscharsets
WHERE  [name] = N'iso_1';

returns:

ISO 8859-1 (Latin-1) - Western European 8-bit character set.

A + B + C =

So, there we have it. Microsoft did, in fact, mislabel “Windows-1252” as being “ISO-8859-1”. And also as being “ANSI” (more on that in a moment). But, why, why, why?

Confirmation

Looking around the interwebs, I found a few sources of explanation. Or, more likely, a few variations of one source.

Oldest

In checking one of the references on that “Windows-1252” page on Wikipedia, I found the following quote which explains how this mislabeling came to be:

The term “ANSI” as used to signify Windows code pages is a historical reference, but is nowadays a misnomer that continues to persist in the Windows community. The source of this comes from the fact that the Windows code page 1252 was originally based on an ANSI draft, which became ISO Standard 8859-1. However, in adding code points to the range reserved for control codes in the ISO standard, the Windows code page 1252 and subsequent Windows code pages originally based on the ISO 8859-x series deviated from ISO. To this day, it is not uncommon to have the development community, both within and outside of Microsoft, confuse the 8859-1 code page with Windows 1252, as well as see “ANSI” or “A” used to signify Windows code page support.
(Cathy Wissink, “Unicode and Windows XPPDF via archive.org, May 2002, Page 1)

Recent

A Microsoft documentation page that’s no longer on the live site, Glossary of Terms Used on this Site (archived on 2018-12-08), has similar wording:

ANSI: Acronym for the American National Standards Institute. The term “ANSI” as used to signify Windows code pages is a historical reference, but is nowadays a misnomer that continues to persist in the Windows community. The source of this comes from the fact that the Windows code page 1252 was originally based on an ANSI draft—which became International Organization for Standardization (ISO) Standard 8859-1. “ANSI applications” are usually a reference to non-Unicode or code page–based applications.

The italicized portion above is identical to what Cathy Wissink said back in 2002. The earliest capture for that page is from March of 2009:
Glossary of Terms Used on this Site (archived on 2009-03-09)

While I’m sure that the page existed in some form prior to 2009, it certainly seems like Cathy is the source of that statement.

Current

The active Microsoft documentation page, Internationalization: Code Pages (current as of 2021-05-31), contains the following note:

Originally, Windows code page 1252, the code page commonly used for English and other Western European languages, was based on an American National Standards Institute (ANSI) draft. That draft eventually became ISO 8859-1, but Windows code page 1252 was implemented before the standard became final, and is not exactly the same as ISO 8859-1.

The earliest capture for that page is from May of 2008:
International Features: Code Pages (archived on 2008-05-05)

This appears to just be a rewording of what Cathy wrote (back in 2002).

Bonus Round: ANSI

Both the SQL Server 7.0 documentation (two sections above) and Cathy Wissink (section directly above) mention “ANSI”, the American National Standards Institute. This is another term that is misused within Microsoft / Windows / SQL Server, and in two different ways (how fun is that!?).

First, “ANSI” is sometimes used to mean specifically “Windows-1252”. This can be seen in the SQL Server 7.0 documentation quoted previously, as well as in the documentation for some SQL Server features / utilities such as the ACP ([A]NSI [C]ode [P]age) code page value for:

(but not SQLCMD.exe Utility).

For all three of those, the documentation describes ACP as:

ANSI/Microsoft Windows (ISO 1252)

Of course, there is no ISO 1252 standard. Which brings us to the second way in which the term “ANSI” is (mis)used. In a more general sense, it refers to the following list of code pages:

  1. 874 ( Thai )
  2. 932 ( Japanese )
  3. 936 ( Chinese {simplified} )
  4. 949 ( Korean )
  5. 950 ( Chinese {traditional} )
  6. 951 ( Chinese {traditional} ) Note: only found this listed in one place
  7. 1250 ( Latin 2 / Central European )
  8. 1251 ( Cyrillic )
  9. 1252 ( Latin 1 / Western European )
  10. 1253 ( Greek )
  11. 1254 ( Turkish )
  12. 1255 ( Hebrew )
  13. 1256 ( Arabic )
  14. 1257 ( Baltic )
  15. 1258 ( Vietnamese )
  16. 1361 ( Korean {Johab} ) Note: only found this listed on “Appendix E: Code Page Support

Again, none of these are ANSI or ISO standards; they are Microsoft-specific.

You can find more info on the following Wikipedia page: Windows code page. That page, in the “ANSI code page” section, also mentions how this mislabeling of Microsoft code pages as being “ISO” came to be. The source for that page is the “Glossary of Terms…” page quoted in the previous section.

 

Summary

CP1 / iso_1

  • means “ISO-8859-1”,
    but….
  • is “Windows-1252”

While those two code pages are very similar, they are not the same. They differ only in the 0x80 – 0x9F range:

  • “ISO-8859-1”: “C1” control code
  • “Windows-1252”: 27 usable / printable characters

Essentially, "CP1" in the collation names and "iso_1" as the associated character set name are mistakes, but they have propagated too far, and for too long, to be corrected (similar to the "referer" HTTP header which should have been "referrer").

ANSI Code Page(s) / ACP

Refers to one or more code pages within the following contexts:

  • In various places within SQL Server (and possibly other Microsoft products): “Windows-1252”
  • Within Microsoft in general: 874, 932, 936, 949, 950, 951, 1250, 1251, 1252, 1253, 1254, 1255, 1256, 1257, 1258, 1361
  • In reality: none of them

Why Windows code pages (especially 1252) are mislabeled as ISO / ANSI

“Windows code page 1252 was based on an ANSI draft which eventually became ISO 8859-1.”

Well, that’s what we’ve been told, and it certainly does help explain how it all started. But, I don’t think it explains the scope of the mislabeling, especially how “ISO” and “ANSI” are sometimes used to refer to a set of code pages, sometimes a single code page, and sometimes even one specific code page: 1252. I believe a combination of factors have contributed to this inconsistent, and often erroneous, usage:

  1. poor communication: Microsoft is a large company that produces a good number of software products, and that means a large number of teams. Coordinating changes across teams gets considerably harder as the number of teams increases.
  2. poor documentation: When the Windows and ISO 8859 series of code pages were being created, it seems that documentation was not revised to reflect the current status of the standards as they were being approved and published.
  3. poor overall understanding: The documentation team can’t provide accurate documentation if they are given incomplete or erroneous information from the product teams, and if they aren’t told about changes, and if they themselves don’t understand the topic they are documenting well enough to identify areas that require additional testing or push-back to a product team (nothing against the documentation team: encodings are complicated, and it’s often quite difficult to identify what you don’t know). I think the ACP code page option for BCP.exe, BULK INSERT, and OPENROWSET(BULK…) might end up being a good example of this. The documentation for all three currently states that this value means code page 1252, but I believe it’s more likely that the ACP option pulls the ANSI Code Page value from the OS (similar to the OEM option pulling the OEM Code Page value from the OS). I will investigate that later.

Post Update History
  • 2021-06-07 @ 12:30 EST / 2021-06-07 @ 16:30 UTC — Added note in the Summary: CP1 section about "CP1" being a mistake like the "referer" HTTP header.
  • 2021-06-28 @ 02:30 EST / 2021-06-28 @ 06:30 UTC — Added “Part B” and “Part C” in the Clue Numero Dos section.

2 thoughts on “SQL Server Collations: What does “CP1” mean in “SQL_Latin1_General_CP1_CI_AS”?”

Leave a Reply