heistak/your-code-displays-japanese-wrong

Question - " the default fallback behavior in an ambiguous situation is to choose the Chinese glyph set" - is this true?

LittleWhole opened this issue · 2 comments

Hi, I'm Mainland Chinese and I know and encounter this problem a lot too, and agree that it needs more exposure and solutions.

However, I have a question about the line "the default fallback behavior in an ambiguous situation is to choose the Chinese glyph set". In my personal (anecdotal, so yes, inherently flawed) experience, I've always observed that the opposite is true. For example, on a fresh install of Windows 11 or Ubuntu, going into YouTube causes Simplified Chinese to be displayed in the Japanese font, most notably the squished-together 复 and 关, and the 直 with the extra vertical stroke to the left.

Actually, just typing those in right now, I just realized that GitHub is doing the same thing for me right now...
image

I've always chalked this down to "Japan industrialized before us, they must have created computer fonts before us, it's only natural". But this is the first time I've heard about Chinese glyphs being displayed in the place of Japanese glyphs. I found that even Traditional Chinese tends to be displayed in Japanese style (like 備).
image

Even Kyuujitai Japanese glyphs take preference over Traditional Chinese glyphs (like 縣).
image

I'm just a little curious about the default behaviors and roots of the problem 😅 Would be nice to get more concrete information about the problem.

I think it is not 100% accurate to say "the default fallback behavior in an ambiguous situation is to choose the Chinese glyph set" without any context, at least in Windows. It is a little bit more complicated than that.

In Windows this "fallback" really depends on multiple factors:

  • It checks what apps set for language / regional settings by developers
  • It also considers about system regional settings. If it is a legacy app that is not using Unicode it also checks non-Unicode settings (normally we don't care about this)
  • Since Windows 8, it also takes Language list into consideration. You remember this?

image

(Ignore the Japanese UI :))

The order of the languages actually matters, because based on the apps and situations, Windows may look for fallback fonts according to this list. I will show you a bit.

First, let's see a brand-new installed Windows 10 in English (US) language.
What I did is to write a very simple HTML file, save it in UTF-8 encoding, and open it in a browser to display.

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>刃</title>
</head>

<body>
<h1>This is 刃. In Simplified Chinese it is 刃<h1>
</body>

</html>

(Yeah I typed the first 刃 using Japanese IME and second one using Chinese IME)

Here is the result. Remember this is a brand-new Windows 10 with default English (US) settings. No IME installed.

image

Notice the two different glyphs? This is what things get complicated.
(See https://docs.microsoft.com/en-us/globalization/input/font-technology for more information on the following)

Windows uses multiple levels of font fallback mechanism. Apps (in this case Microsoft Edge) may set to use Chinese (Simplified) as fallback.
(I guess this may be the reason why the author says the "default" fallback is Chinese -- He might be talking about the Web app development scenarios)

However, the tab bar uses Japanese. This might be what Windows Font Link mechanism kicks in. If you use Registry Editor to check HKEY_LOCAL_MACHINE–\SOFTWARE\Microsoft\Windows NT\CurrentVersion\FontLink\SystemLink, you will find a bunch of system default UI font links. English (US) uses Segoe UI, so let's see what it has:

image

You see, it tries to fallback to Tahoma, then directly the Meiryo UI, which is a Japanese font.

So at least in Windows, it really depends on what context you are talking about for "default" fallback.


But remember I mentioned the Language List?

So if you add Japanese IME and Chinese Simplified IME, then re-open this HTML, you will see something like this:

image

This is when Japanese moves forward.

image

This is when Chinese Simplified moves forward.

To make this more fun, let's try Chinese (Traditional, Taiwan):

image

See? You see Microsoft Edge honors Windows Language List.
(Again, notice the tab bar font never changes, because the system regional settings are not changed)


Let's see what other browsers do. First, Google Chrome:

image

OH!!! It does not care about the Language List! (Notice the language list order)
(Which means it only checks its own language list. I added Japanese then it shows in Japanese)

Mozilla Firefox:

image

This behaves the same as Microsoft Edge. It honors Language List.

What if I removed all languages then try Firefox again?

image

Yep this is same as Microsoft Edge.


Just for fun, I did test this in a brand-new installed Lubuntu 20.04 LTS (Ubuntu's LXDE/LXQt variant). Here are the results.
(There are too many Linux GUIs, so I just tested what I use often)

image

Firefox. Notice the different glyphs.
(And before you ask -- Yes, this is an older version of Firefox. But I did this test again in the latest Firefox version as well, and result is same)

image

Chromium. The default fallback becomes Japanese! Not Chinese (Simplified) anymore!
(And yes, the language list of Chromium is English + English (US) only. No Google account login, either)

image


So bottom line:

  • For Web pages and Web apps running in Windows, it is true that the "default" fallback is Chinese (Simplified) for Han/Kanji characters. And to make it Japanese only the author's suggestion of using works (and it should be used in other languages as well, I think).
    However Linux GUI may not be like this. Windows native apps also don't use Chinese as fallbacks.
  • Different Web browsers may have different behaviors depending on their runtime environments, implementations and settings, which complicates the situation:
    • Microsoft Edge and Mozilla Firefox in Windows honors Windows Language List for Web page contents
    • Google Chrome does not care about Windows Language List and only uses its own
    • Microsoft Edge and Google Chrome user interface parts use Windows own font fallback mechanism (Font Link, possibly)
  • So really and seriously, the only best way to get out of such chaotic situation is NOT to rely on these fallback mechanisms, and instead to explicitly specify your intended language / culture when you are doing localization / internationalization, no matter what types of apps you are making.

BTW I highly recommend the following references if you want to know more about Unicode and how Asian characters are handled in general:

CJKV Information Processing: Chinese, Japanese, Korean & Vietnamese Computing (Second Edition)

I also read this before, though it is quite old (this is a Microsoft book):

Developing International Software (2nd Edition)


(And sorry about this long post in this issue)
(EDIT: And I forgot the macOS platform...Well, it doesn't matter, you get the idea)
(EDIT 2: Attached test.zip)

Apologies for not noticing this thread earlier!
My original motive for creating this document was being annoyed after seeing incorrect glyphs in video games so many times, from indie releases to even major releases by large international companies like Minecraft. Hence, my mindset when writing this was mostly focused on video games, where text rendering is usually implemented separately from the native OS, and the issue does seem to happen very very frequently. The sentence in question does say "in many cases" -- not "all" or "most" cases -- so while I agree the default behavior differs depending on the environment, I would still like to stand by my original wording.