This repo is a sample for dotnet/coreclr issue here: https://github.com/dotnet/coreclr/issues/20388
- Issue Description
- Issue Cause Summarized
- Issue Symptoms
- Apparent Root Cause
- Possible Failure of GetCultureInfo
- Running the Test Application
Certain valid locales cannot be used for localization in .NET Core on Unix-based environments, because they are not recognized by CultureInfo and its surrounding classes. Although not the only affected locale, this is most easily reproduced with zh-TW (Chinese, Taiwan).
This affects any locale which is an "aliased" locale in ICU ICU (International Components for Unicode).
To summarize what appears to be the cause of the issue:
- Microsoft's mscorlib in dotnet/coreclr uses the 3rd party library ICU (International Components for Unicode) to support non-Windows, platform agnostic localization features such as knowing if a given locale is valid and details about the locale (collation, numeric formatting, etc).
- ICU is only used in non-Windows runtimes of .NET Core
- To save on data space while providing a robust parenting algorithm, ICU defines certain locales as "aliases" of others.
- ICU aliases are intentionally not returned by ICU when requesting a list of locales.
- Some Culture concerns in mscorlib currently depend on getting an up-front list of all available cultures.
CultureData.EnumCultures
, which in turn via native interop calls the ICU C APIuloc_getAvailable
, therefore fails to obtain any cultures which are defined as aliases in ICU, and these cultures become invalid or are populated with incorrect data in various contexts.
Further details can be found in the root cause analysis section below.
There are a number of symptoms that led to the discovery, including but not limited to the following - all of which are demonstrated in the test application further below:
The issue was first noticed when zh-TW resource files were missing from our published applications.
This involves a .NET Core 2.1 project which uses localized resource (.resx) files to localize strings. As per the usual resx pattern, a default document exists (e.g., MyStrings.resx), and locale-specific files are named with the locale name before the extension, and one of these is zg-TW (e.g., MyStrings.zh-TW.resx). During the build & publish process, these files become compiled into MyApp.resources.dll, and the localized versions are copied into sub-folders based on the names of the locale.
When building on Windows, this functions correctly. However, when running dotnet publish on a Linux environment, the zh-TW folder will be missing.
The affected cultures are completely missing from the list of available cultures, obtained when calling:
var allCultures = CultureInfo.GetCultures(CultureTypes.AllCultures);
Our first attempt at a quick fix involved a workaround to our .csproj file to force the zh-TW resources file to be copied into the appropriate folder during a publish. This ultimately didn't solve anything, however, because while this worked when tested in Windows, when the app was running under Linux, and a string was requested with Culture zh-TW, the english strings were returned.
This workaround can be seen in action in the test app for zh-CN specifically:
<Target Name="AssignCustomCultures" AfterTargets="SplitResourcesByCulture" Condition="'$(EnableAssignCustomCultures)' != ''">
<ItemGroup>
<EmbeddedResource Condition="$([System.IO.Path]::HasExtension(%(Filename))) AND $([System.IO.Path]::GetExtension(%(Filename)).TrimStart('.')) == '$(EnableAssignCustomCultures)'">
<Culture>$([System.IO.Path]::GetExtension(%(Filename)).TrimStart('.'))</Culture>
<WithCulture>true</WithCulture>
</EmbeddedResource>
</ItemGroup>
</Target>
When trying to get a CultureInfo object for zh-TW, the object returned has some values which are notably different on Linux versus Windows. Most notably, the parent locale which is expected to be zh-Hant, and is defined as such in ICU (zh-TW alias to zh-Hant-TW child of zh-Hant), but is not the parent culture returned when running under Linux.
After a great deal of investigation, I have narrowed down what I believe to be the root cause.
My starting point for uncovering a problem here began with the failure to publish zh-TW resource files. A look at a verbose build log during a dotnet publish showed the following:
Added Item(s):
ResxWithCulture=
Resources/MyNetCoreProject.MyResources.de.resx
Culture=de
OriginalItemSpec=Resources/MyNetCoreProject.MyResources.de.resx
TargetPath=Resources/MyNetCoreProject.MyResources.de.resx
WithCulture=true
Resources/MyNetCoreProject.MyResources.ja-JP.resx
Culture=ja-JP
OriginalItemSpec=Resources/MyNetCoreProject.MyResources.ja-JP.resx
TargetPath=Resources/MyNetCoreProject.MyResources.ja-JP.resx
WithCulture=true
Removed Item(s):
_MixedResourceWithNoCulture=
Resources/MyNetCoreProject.MyResources.zh-TW.resx
OriginalItemSpec=Resources/MyNetCoreProject.MyResources.zh-TW.resx
TargetPath=Resources/MyNetCoreProject.MyResources.zh-TW.resx
WithCulture=false
As you can see, the zh-TW resource file is removed because it is placed into the "WithNoCulture" bucket. Back-tracking how this functions, we start from the task named in the log SplitResourceByCulture:
Microsoft/msbuild/src/Tasks/Microsoft.Common.CurrentVersion.targets - SplitResourceByCulture -> Microsoft.Build.Tasks.AssignCulture.Execute
Inside AssignCulture.Execute(), we can see that Culture.GetItemCultureInfo is used to get culture info:
Culture.ItemCultureInfo info = Culture.GetItemCultureInfo
Microsoft.Build.Tasks.Culture.GetItemCultureInfo
validCulture = CultureInfoCache.IsValidCultureString(cultureName);
Microsoft.Build.Tasks.CultureInfoCache.IsValidCultureString
The CultureInfoCache is populated as such:
foreach (CultureInfo cultureName in AssemblyUtilities.GetAllCultures())
{
ValidCultureNames.Add(cultureName.Name);
}
Microsoft.Build.Shared.AssemblyUtilities.GetAllCultures
At this point, we finally get out of MSBuild code and into coreclr:
return CultureInfo.GetCultures(CultureTypes.AllCultures);
All roads lead to the CultureData class in dotnet/coreclr.
When calling CultureInfo.GetCultures:
CultureInfo.GetCultures -> CultureData.GetCultures -> CultureData(Unix).EnumCultures -> System.Globalization.Native/locale.cpp:GlobalizationNative_GetLocales
Ultimately ending here in System.Globalization.Native/locale.cpp:
int32_t localeCount = uloc_countAvailable();
if (localeCount <= 0)
return -1; // failed
for (int32_t i = 0; i < localeCount; i++)
{
const char *pLocaleName = uloc_getAvailable(i);
if (pLocaleName[0] == 0) // unexpected empty name
return -2;
This is when the stack enters the C API for the ICU - the International Components for Unicode library.
The uloc_countAvailable and uloc_getAvailable calls are part of the ICU C API. (Mostly functionally equivalent calls exist in the C++ API in the Locale class, however in my testing the exhibit the same behavior as the C API calls.)
We can directly test that ICU isn't returning zh-TW in both the C API method used above and the C++ API, as I have done in this repo: CodingDinosaur/CultureIcuTest
To better understand why ICU isn't returning zh_TW and some other locales, we need to better understand how ICU's data works.
ICU's data comes from the CLDR - Unicode Common Locale Data Repository. This is part of what makes ICU attractive -- it is intentionally completely platform agnostic and does not depend on the hosting environment for locale information in any way. From the ICU FAQ:
What is the relationship between ICU locale data and system locale data?
There is no relationship. ICU is not dependent on the operating system for the locale data.
This also means that uloc_setDefault() does not affect the operating system. The function uloc_setDefault() only sets ICU's default locale. Normally the default locale for ICU is whatever the operating system says is the default locale.
For that reason, ICU includes all of its own data from CLDR.
CLDR database size has been an ongoing concern for ICU, and some locales which share the same data but don't have a parent / child relationship would require data duplication (or otherwise require logic more complex than was desired). To de-duplicate this data, ICU added the concept of locale "aliases". From ICU's documentation on ICU resource bundles:
A value can also be an "alias", which is simply a reference to another bundle's item. This is to save space by storing large data pieces only once when they cannot be inherited along the locale ID hierarchy (e.g., data in ICU shared among zh_HK and zh_TW).
When a locale name is defined as an alias in ICU, then from the standpoint of ICU it isn't a first-class locale -- merely a pointer to a "real" locale when requested. As a result, ICU does not return aliases when getting a list of locales -- whether with uloc_getAvailable
or Locale::getAvailableLocales
(and uloc_countAvailable
does not include them in its count).
That ICU does not return the aliases in this manner appears to be intentional, both based on the numerous references to a lack of alias mapping in the uloc documentation, and the following bug:
https://unicode-org.atlassian.net/browse/ICU-4309
uloc_getAvailable returns sr_YU, even though it is an %%ALIAS locale. None of the other %%ALIAS locales are returned.
TracBot made changes - 01/Jul/18 1:59 PM
Resolution Fixed [ 10004 ]
Status Done [ 10002 ] Done [ 10002 ]
That this bug was fixed is a further and very strong indication that ICU not returning locale aliases is intentional.
These are all the locales that have aliases for the purposes of locale identification. Other data types in ICU (such as collation and rule-based number formatting) have additional aliases.
Locale | Aliases |
---|---|
ar_SA | ars |
az_Latn_AZ | az_AZ |
bs_Latn_BA | bs_BA |
en_VU | en_NH |
en_ZW | en_RH |
fil | tl |
fil_PH | tl_PH |
he | iw |
he_IL | iw_IL |
id | in |
id_ID | in_ID |
nb | no |
nb_NO | no_NO |
nn_NO | no_NO_NY |
pa_Arab_PK | pa_PK |
pa_Guru_IN | pa_IN |
ro_MD | mo |
shi_Tfng_MA | shi_MA |
sr_Cyrl_BA | sr_BA |
sr_Cyrl_RS | sr_CS, sr_Cyrl_CS, sr_Cyrl_YU, sr_RS, sr_YU |
sr_Cyrl_XK | sr_XK |
sr_Latn | sh |
sr_Latn_BA | sh_BA |
sr_Latn_ME | sr_ME |
sr_Latn_RS | sh_CS, sh_YU, sr_Latn_CS, sr_Latn_YU |
uz_Arab_AF | uz_AF |
uz_Latn_UZ | uz_UZ |
vai_Vaii_LR | vai_LR |
yue_Hans_CN | yue_CN |
yue_Hant_HK | yue_HK |
zh_Hans_CN | zh_CN |
zh_Hans_SG | zh_SG |
zh_Hant_HK | zh_HK |
zh_Hant_MO | zh_MO |
zh_Hant_TW | zh_TW |
When one of the affected Cultures is obtained using CultureInfo.GetCultureInfo
, the resulting CultureInfo
object contains a mixture of expected and unexpected data. For example, the ANSICodePage is correct, but the parent locale is not.
When calling GetCultureInfo, the code path ultimately leads to native calls to get the specific locale here: System.Globalization.Native/locale.cpp:GlobalizationNative_GetLocaleName -> System.Globalization.Native/locale.cpp:GlobalizationNative_GetLocale
Unlike the previous example of getting all locales, ICU appears to return correct data for zh-TW, as can be see in the ICU test app. However, it is possible that some of the numerous properties initializers, and constructors for CultureInfo and its child objects utilize or otherwise depend on the data previously obtained via CultureData.EnumCultures
.
(Among other places, the CultureInfoConverter
in System.ComponentModel
depends on the full culture list.)
To run the test, you will need:
- .NET Core SDK 2.1+
- Docker
There are two test scripts, with both .sh and .bat versions of both.
- To test on your current OS run test-CurrentPlatform.bat
- To test in Linux via Docker run test-dockerLinux.bat
- The docker image is based on the .NET SDK docker image.
My recommendation is that you run both tests from Windows using Docker for Windows, as this will give you the opportunity to see the differences in the results under either platform.
When running the test, take note of the following, which function correctly under Windows but not under Linux:
- When checking all available cultures with CultureInfo.GetCultures, zh-TW and zh-CN are both missing on Linux.
- The zh-TW resource file is missing, because it wasn't published during the dotnet publish.
- The zh-CN resource file would be missing too, but it was copied over via a custom build step.
- Even though the zh-CN resource file is present, on Linux zh-CN strings fail to be retrieved (falling back on the default).