(last updated: 2021-05-31 @ 19:00 EST / 2021-05-31 @ 23:00 UTC )
Many of us that work with Microsoft SQL Server, especially those of us working on systems with US English as the OS language, or anyone working with SQL Server Express LocalDB (commonly referred to as just “LocalDB”), have had to deal with the SQL_Latin1_General_CP1_CI_AS
collation.
Collation names, at least in SQL Server, are structured in that they contain various pieces of information detailing some of the properties of the collation. The collation name of SQL_Latin1_General_CP1_CI_AS
can be broken down as follows:
SQL_
This indicates that the collation is a SQL Server collation, while names without this prefix indicate Windows collationsLatin1_General
This is the culture (a.k.a. locale) used for the sorting and comparison rulesCP1
This is the 8-bit [C]ode [P]ageCI
This indicates [C]ase- ensitive or nsensitiveAS
This indicates [A]ccent- ensitive or nsensitive
Ok, but there’s no code page "1". So, what’s going on there?
TL; DR: Skip to “Summary”
Look a Little Deeper
Let’s take a closer look at the actual code page values for each name:
SELECT col.[name], SUBSTRING(col.[name], 5, ISNULL(NULLIF(CHARINDEX(N'_Pref', col.[name]), 0), CHARINDEX(N'_CP', col.[name])) - 5) AS [Locale / Culture], SUBSTRING(col.[name], CHARINDEX(N'_CP', col.[name]), PATINDEX(N'%[_][BC][IS]%', col.[name]) - CHARINDEX(N'_CP', col.[name]) + 1) AS [CP from name], COLLATIONPROPERTY(col.[name], 'CodePage') AS [CodePage] FROM sys.fn_helpcollations() col WHERE col.[name] LIKE N'SQL[_]%' ORDER BY [CodePage], col.[name]; -- 77 rows
The query shown above returns the result set shown below (I removed approximately 37 rows that didn’t show any meaningful variation):
Locale / CP Code name Culture from name Page SQL_Latin1_General_CP437_CI_AS Latin1_General _CP437_ 437 SQL_1xCompat_CP850_CI_AS 1xCompat _CP850_ 850 SQL_AltDiction_CP850_CI_AI AltDiction _CP850_ 850 SQL_Latin1_General_CP850_BIN Latin1_General _CP850_ 850 SQL_Scandinavian_CP850_CS_AS Scandinavian _CP850_ 850 SQL_Croatian_CP1250_CS_AS Croatian _CP1250_ 1250 SQL_Czech_CP1250_CS_AS Czech _CP1250_ 1250 SQL_Hungarian_CP1250_CS_AS Hungarian _CP1250_ 1250 SQL_Latin1_General_CP1250_CS_AS Latin1_General _CP1250_ 1250 SQL_Polish_CP1250_CI_AS Polish _CP1250_ 1250 SQL_Romanian_CP1250_CI_AS Romanian _CP1250_ 1250 SQL_Slovak_CP1250_CS_AS Slovak _CP1250_ 1250 SQL_Slovenian_CP1250_CS_AS Slovenian _CP1250_ 1250 SQL_Latin1_General_CP1251_CS_AS Latin1_General _CP1251_ 1251 SQL_Ukrainian_CP1251_CI_AS Ukrainian _CP1251_ 1251 SQL_Danish_Pref_CP1_CI_AS Danish _CP1_ 1252 SQL_EBCDIC037_CP1_CS_AS EBCDIC037 _CP1_ 1252 SQL_EBCDIC1141_CP1_CS_AS EBCDIC1141 _CP1_ 1252 SQL_EBCDIC273_CP1_CS_AS EBCDIC273 _CP1_ 1252 SQL_EBCDIC277_2_CP1_CS_AS EBCDIC277_2 _CP1_ 1252 SQL_EBCDIC277_CP1_CS_AS EBCDIC277 _CP1_ 1252 SQL_EBCDIC278_CP1_CS_AS EBCDIC278 _CP1_ 1252 SQL_EBCDIC280_CP1_CS_AS EBCDIC280 _CP1_ 1252 SQL_EBCDIC284_CP1_CS_AS EBCDIC284 _CP1_ 1252 SQL_EBCDIC285_CP1_CS_AS EBCDIC285 _CP1_ 1252 SQL_EBCDIC297_CP1_CS_AS EBCDIC297 _CP1_ 1252 SQL_Icelandic_Pref_CP1_CI_AS Icelandic _CP1_ 1252 SQL_Latin1_General_CP1_CI_AI Latin1_General _CP1_ 1252 SQL_SwedishPhone_Pref_CP1_CI_AS SwedishPhone _CP1_ 1252 SQL_SwedishStd_Pref_CP1_CI_AS SwedishStd _CP1_ 1252 SQL_AltDiction2_CP1253_CS_AS AltDiction2 _CP1253_ 1253 SQL_Latin1_General_CP1253_CI_AI Latin1_General _CP1253_ 1253 SQL_MixDiction_CP1253_CS_AS MixDiction _CP1253_ 1253 SQL_Latin1_General_CP1254_CS_AS Latin1_General _CP1254_ 1254 SQL_Latin1_General_CP1255_CI_AS Latin1_General _CP1255_ 1255 SQL_Latin1_General_CP1256_CS_AS Latin1_General _CP1256_ 1256 SQL_Estonian_CP1257_CI_AS Estonian _CP1257_ 1257 SQL_Latin1_General_CP1257_CS_AS Latin1_General _CP1257_ 1257 SQL_Latvian_CP1257_CI_AS Latvian _CP1257_ 1257 SQL_Lithuanian_CP1257_CS_AS Lithuanian _CP1257_ 1257
From those results we can clearly see that CP1
is code page 1252 (officially known as “Windows-1252”).
Um, ok, but why is it "CP1" in the collation name instead of "CP1252"? It’s the only one that doesn’t fit the pattern. Why is it special?
Clue # ????? (Echad) ( 1 )
By querying INFORMATION_SCHEMA.COLUMNS
we can see the character set name (this is the name associated with the code page):
DECLARE @CP1 TABLE ([_CollationCodePageTest_] VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS); SELECT col.[TABLE_NAME], col.[CHARACTER_SET_NAME], col.[COLLATION_NAME] FROM tempdb.INFORMATION_SCHEMA.COLUMNS col WHERE col.[COLUMN_NAME] = N'_CollationCodePageTest_'; /* TABLE_NAME CHARACTER_SET_NAME COLLATION_NAME #B03667F2 iso_1 SQL_Latin1_General_CP1_CI_AS */
And, if your instance is using a CP1
collation as the instance (i.e. server) -level collation (and sadly that is the only option for SQL Server Express LocalDB), then you can use the following query to also see the sort order name:
SELECT SERVERPROPERTY('Collation') AS [Collation], SERVERPROPERTY('SqlCharSetName') AS [SqlCharSetName], SERVERPROPERTY('SqlSortOrderName') AS [SqlSortOrderName]; /* Collation SqlCharSetName SqlSortOrderName SQL_Latin1_General_CP1_CI_AS iso_1 nocase_iso */
Well, that’s not very helpful. Or, maybe it is? I mean, the "1" matches, but the "iso" part doesn’t really narrow it down. We can safely assume that this does not mean the actual ISO 1 standard: “Geometrical product specifications”. But, the International Organization for Standardization (i.e. ISO) currently maintains 50 active standards within the "Coding of Character Sets" category.
Also, all of the other "SqlCharSetName" values match the following pattern:
- Non-Unicode code pages: "cp{code_page_number}" (i.e. same as in the collation name, just with lower-case "cp")
- UTF-8 code pages (starting in SQL Server 2019): "utf8"
- Unicode-only collations: "<NULL>"
It is possible, however, that "iso" refers to the ISO/IEC 8859 series of encodings, given that “Part 1” of that series is “Latin-1”. The Wikipedia article for the “Windows-1252” encoding states:
This character encoding is a superset of ISO 8859-1 in terms of printable characters, but differs from the IANA's ISO-8859-1 by using displayable characters rather than control characters in the 80 to 9F (hex) range. Notable additional characters include curly quotation marks and all the printable characters that are in ISO 8859-15 (at different places than ISO 8859-15). It is known to Windows by the code page number 1252, and by the IANA-approved name "windows-1252".
It is very common to mislabel Windows-1252 text with the charset label ISO-8859-1. A common result was that all the quotes and apostrophes (produced by "smart quotes" in word-processing software) were replaced with question marks or boxes on non-Windows operating systems, making text difficult to read. Most modern web browsers and e-mail clients treat the media type charset ISO-8859-1 as Windows-1252 to accommodate such mislabeling. This is now standard behavior in the HTML5 specification, which requires that documents advertised as ISO-8859-1 actually be parsed with the Windows-1252 encoding.
For anyone who’s curious, here’s an example of the confusion surrounding these two character sets / encodings / code pages. It’s a W3C discussion concerning their HTML validator from May of 2001: RE: iso-8859-1-Windows-3.1-Latin-1.
Could it be that Microsoft mislabeled “Windows-1252” (their own encoding) with “ISO-8859-1”?
Clue Numero Dos ( 2 )
Recently, while researching another topic, I came across the following SQL Server 7.0 documentation (published in 1998):
Code Page 1252 (ISO Character Set)
Code page 1252 (ISO character set) is the default character set. It is also known as the ISO 8859-1, Latin 1, or ANSI character set. It is compatible with the ANSI characters used by the Microsoft® Windows NT® and Microsoft Windows® operating systems.
( Download self-extracting zip file containing the SQL Server 7.0 Books Online documentation )
So, there we have it. Microsoft did, in fact, mislabel “Windows-1252” as being “ISO-8859-1”. And also as being “ANSI” (more on that in a moment). But, why, why, why?
Confirmation
Looking around the interwebs, I found a few sources of explanation. Or, more likely, a few variations of one source.
Oldest
In checking one of the references on that “Windows-1252” page on Wikipedia, I found the following quote which explains how this mislabeling came to be:
The term “ANSI” as used to signify Windows code pages is a historical reference, but is nowadays a misnomer that continues to persist in the Windows community. The source of this comes from the fact that the Windows code page 1252 was originally based on an ANSI draft, which became ISO Standard 8859-1. However, in adding code points to the range reserved for control codes in the ISO standard, the Windows code page 1252 and subsequent Windows code pages originally based on the ISO 8859-x series deviated from ISO. To this day, it is not uncommon to have the development community, both within and outside of Microsoft, confuse the 8859-1 code page with Windows 1252, as well as see “ANSI” or “A” used to signify Windows code page support.
(Cathy Wissink, “Unicode and Windows XP” PDF via archive.org, May 2002, Page 1)
Recent
A Microsoft documentation page that’s no longer on the live site, Glossary of Terms Used on this Site (archived on 2018-12-08), has similar wording:
ANSI: Acronym for the American National Standards Institute. The term “ANSI” as used to signify Windows code pages is a historical reference, but is nowadays a misnomer that continues to persist in the Windows community. The source of this comes from the fact that the Windows code page 1252 was originally based on an ANSI draft—which became International Organization for Standardization (ISO) Standard 8859-1. “ANSI applications” are usually a reference to non-Unicode or code page–based applications.
The italicized portion above is identical to what Cathy Wissink said back in 2002. The earliest capture for that page is from March of 2009:
Glossary of Terms Used on this Site (archived on 2009-03-09)
While I’m sure that the page existed in some form prior to 2009, it certainly seems like Cathy is the source of that statement.
Current
The active Microsoft documentation page, Internationalization: Code Pages (current as of 2021-05-31), contains the following note:
Originally, Windows code page 1252, the code page commonly used for English and other Western European languages, was based on an American National Standards Institute (ANSI) draft. That draft eventually became ISO 8859-1, but Windows code page 1252 was implemented before the standard became final, and is not exactly the same as ISO 8859-1.
The earliest capture for that page is from May of 2008:
International Features: Code Pages (archived on 2008-05-05)
This appears to just be a rewording of what Cathy wrote (back in 2002).
Bonus Round: ANSI
Both the SQL Server 7.0 documentation (two sections above) and Cathy Wissink (section directly above) mention “ANSI”, the American National Standards Institute. This is another term that is misused within Microsoft / Windows / SQL Server, and in two different ways (how fun is that!?).
First, “ANSI” is sometimes used to mean specifically “Windows-1252”. This can be seen in the SQL Server 7.0 documentation quoted previously, as well as in the documentation for some SQL Server features / utilities such as the ACP
([A]NSI [C]ode [P]age) code page value for:
- BCP.exe Utility (e.g.
-C ACP
) - BULK INSERT (e.g.
WITH (CODEPAGE = 'ACP')
) - OPENROWSET(BULK…) (e.g.
CODEPAGE = 'ACP'
)
(but not SQLCMD.exe Utility).
For all three of those, the documentation describes ACP
as:
ANSI/Microsoft Windows (ISO 1252)
Of course, there is no ISO 1252 standard. Which brings us to the second way in which the term “ANSI” is (mis)used. In a more general sense, it refers to the following list of code pages:
- 874 ( Thai )
- 932 ( Japanese )
- 936 ( Chinese {simplified} )
- 949 ( Korean )
- 950 ( Chinese {traditional} )
- 951 ( Chinese {traditional} ) Note: only found this listed in one place
- 1250 ( Latin 2 / Central European )
- 1251 ( Cyrillic )
- 1252 ( Latin 1 / Western European )
- 1253 ( Greek )
- 1254 ( Turkish )
- 1255 ( Hebrew )
- 1256 ( Arabic )
- 1257 ( Baltic )
- 1258 ( Vietnamese )
- 1361 ( Korean {Johab} ) Note: only found this listed on “Appendix E: Code Page Support”
Again, none of these are ANSI or ISO standards; they are Microsoft-specific.
You can find more info on the following Wikipedia page: Windows code page. That page, in the “ANSI code page” section, also mentions how this mislabeling of Microsoft code pages as being “ISO” came to be. The source for that page is the “Glossary of Terms…” page quoted in the previous section.
Summary
CP1
- means “ISO-8859-1”,
but….
- is “Windows-1252”
While those two code pages are very similar, they are not the same. They differ only in the 0x80 – 0x9F range:
- “ISO-8859-1”: “C1” control code
- “Windows-1252”: 27 usable / printable characters
ANSI Code Page(s) / ACP
Refers to one or more code pages within the following contexts:
- In various places within SQL Server (and possibly other Microsoft products): “Windows-1252”
- Within Microsoft in general: 874, 932, 936, 949, 950, 951, 1250, 1251, 1252, 1253, 1254, 1255, 1256, 1257, 1258, 1361
- In reality: none of them
Why Windows code pages (especially 1252) are mislabeled as ISO / ANSI
“Windows code page 1252 was based on an ANSI draft which eventually became ISO 8859-1.”
Well, that’s what we’ve been told, and it certainly does help explain how it all started. But, I don’t think it explains the scope of the mislabeling, especially how “ISO” and “ANSI” are sometimes used to refer to a set of code pages, sometimes a single code page, and sometimes even one specific code page: 1252. I believe a combination of factors have contributed to this inconsistent, and often erroneous, usage:
- poor communication: Microsoft is a large company that produces a good number of software products, and that means a large number of teams. Coordinating changes across teams gets considerably harder as the number of teams increases.
- poor documentation: When the Windows and ISO 8859 series of code pages were being created, it seems that documentation was not revised to reflect the current status of the standards as they were being approved and published.
- poor overall understanding: The documentation team can’t provide accurate documentation if they are given incomplete or erroneous information from the product teams, and if they aren’t told about changes, and if they themselves don’t understand the topic they are documenting well enough to identify areas that require additional testing or push-back to a product team (nothing against the documentation team: encodings are complicated, and it’s often quite difficult to identify what you don’t know). I think the
ACP
code page option for BCP.exe, BULK INSERT, and OPENROWSET(BULK…) might end up being a good example of this. The documentation for all three currently states that this value means code page 1252, but I believe it’s more likely that theACP
option pulls the ANSI Code Page value from the OS (similar to theOEM
option pulling the OEM Code Page value from the OS). I will investigate that later.