WICG/translation-api

language tag handling needs more attention

Opened this issue Β· 6 comments

Language tag handling

Tentatively, pending consultation with internationalization and translation API experts, we propose the following model. Each user agent has a list of (language tag, availability) pairs, which is the same one returned by translation.supportedLanguages(). Only exact matches for entries in that list will be used for the API.

The proposed mechanisms don't make sense. They require absolute tag matches in order to work, when the normal way for translation and locale-based mechanisms to work is either BCP47 Lookup or BCP47 Filtering.

Generally, for this type of API, Lookup is the preferred mechanism, usually with some additional tailoring (the insertion of missing subtags: Intl already provides this).

For example, if a system supports ja and en, then canTranslate() should match requests for en-US, en-GB, ja-JP or ja-u-ca-japanese, but not requests for ena, fr, or zh-Hans.

Failing to provide this sort of support would mean that implementations would have to provide dozens or hundreds of tags that they "support" and/or would require the caller to massage the tag (instead of passing it blindly). This is especially the case in the "download" case, in which a site might generate dozens of spurious downloads due to mutations of the language tag.

Note: a deeper discussion, possibly in a joint teleconference, might be useful here.

I'll add some additional color here as a personal comment here.

Note that there is a tension between source and target language tags. Most translation systems can consume a variety of different orthographic variations of a language to produce a given target language. For example, a language arc such as en=>fr might be able to consume both en-US and en-GB flavo(u)rs of the language (as well as similar-yet-different varieties, such as en-CA, en-AU, etc.) to produce some form of French. In most cases, that language arc will produce a specific form of French, e.g. fr-FR. It might be important to describe very specifically the source and target varieties for users selecting which language arc/language model to download, but equally important not to discriminate between these varieties in the API at runtime (where the additional specificity does more harm than good, such as repeated requests to download additional models, which turn out to be identical to the one already installed).

Note that script and macrolanguage differences remain important here, even when the language tags don't always specify the script. For example, a zh=>en language arc is probably supporting zh-Hans=>en rather than "any" variety of Chinese, since Simplified Chinese is most common. Similarly, tags such as zh-CN, zh-TW, zh-SG, zh-HK or zh-MO each imply a script subtag (zh-Hans-CN, zh-Hant-TW, etc. [they also imply that the language in question is cmn and not, for example, yue]). Allowing implementations to do matching or best-fit matching in canTranslate is probably more helpful than making the API list all the potential variations of language tags (in practice, all of the zh tags are either zh-Hant or zh-Hans, with the region indicating locale differences).

Thanks very much for your comments here. I have learned many new things. Let me try to get more concrete and propose a solution, to see if I've understood correctly.

First, we have to recognize that the ground truth of what is supported is a per-user-agent set of machine learning translation models. These models could have more specific, or less specific capabilities. It depends on how they were trained. Some semi-realistic examples:

  • Any Latin English to standard Japanese: the translation model can take any variety of English (US, UK, Canadian, ...) written in Latin script and outputs Japanese written in kanji + hiragana + katakana.
    • This model cannot understand English written in Braille script.
    • This model cannot output Japanese written in only hiragana + katakana.
  • Any Japanese to US English: the translation model can take Japanese written in a wide variety of scripts (romaji, hiragana-only, kanji + katakana + hiragana) and outputs US English written in Latin.
    • This model cannot understand Japanese written in Braille script (e.g. as romaji).
    • This model will never output UK English specific words, e.g. "colour".
    • This model will never output English in a non-Latin script.
  • Taiwanese traditional Chinese to US English: the translation model can take traditional Chinese input, interpreting idioms and special vocabulary as they would be in Taiwan, and outputs US English written in Latin script.
    • This model does not understand simplified Chinese at all.
    • This model does not understand pinyin.
    • This model will give output that seems wrong to someone from Hong Kong, on certain inputs.
  • Some varieties of Chinese to US English: the translation model can take either traditional or simplified Chinese, written in a variety of scripts and with any regional dialect, and do its best to give the result in US English
    • This model can understand various widely-used dialects of Chinese, but not all. E.g., it can understand traditional and simplified Chinese as used in Hong Kong, the PRC, and Taiwan. Or pinyin versions of those. But it cannot understand classical Chinese, or Min Dong Chinese.
    • This model tries to use context to determine whether to do, e.g., a Hong Kong-based vs. Taiwan-based translation when given traditional Chinese input. Probably it has some bias, but that bias is not intentional.

(Apologies for my lack of knowledge of Chinese... I hope it doesn't sidetrack the examples too badly.)

Given this sort of ground truth, we need an algorithm that takes in arbitrary language tags source and target supplied by the web developer, and then selects the right translation model to use (or download) from this predefined list.

Here is one guess at such an algorithm:

  • Translate the supported language list into pairs of (RFC4647 extended language ranges, RFC3066 language tags). I think the above examples might translate to the following?
    • en-*-Latn-*-*-*-* => ja-Jpn-JP? (Or just ja-Jpn for the target?)
    • jp-*-*-*-*-*-* => en-Latn-US? (Or is * too broad for the source script, and we should have several entries, for Latn, Jpn, and Hrkt?)
    • zh-cmn-Hant-TW-*-*-* => en-Latn-US? (zh-cmn vs. cmn-* is unclear to me...)
    • zh-cmn-*-*-*-* => en-Latn-US? (Again maybe * is too broad for the script and region, and we should have several entries?)
  • Use RFC4647 extended filtering on source vs. the supported language pairs list.
    • This produces a list of pairs (i.e., language models) where the source language is supported.
    • If the list is size 0, then return; we cannot translate the given input.
  • Use RFC4647 lookup, with target as the language range and the list of target languages from our list of pairs.
    • This selects the best matching model within our previously narrowed-down list.
    • If we end up falling back to the default case, then return; we cannot translate the given input.

I think this algorithm works pretty well, although I'm still fuzzy on the best way to set up the list of supported language pairs. For example, if we set up jp-*-*-*-*-*-* => en-Latn-US for the second translation model, then I think the algorithm would return that translation model if given source = "jp-Brai". So probably my parenthetical note about having several script-specific entries is better?

Lots to unpack here. I think it would help to get more involvement from the translation/localization community, who deal with these issues on a daily basis.

General notes to help the conversation along:

  • Translation processes, such as MT, refer to a specific language model as a language arc and that's probably a better choice than saying "language pairs" (for reasons we'll get into)
  • No one should be referring to RFC3066, which has been obsolete for eighteen years now. If you need the specific stable RFC number, refer to RFC5646 for language tags. Note that if you need the old, simpler, grammar for language tags, 5646 has the obs-language-tag production. Generally, one should just refer to BCP47 though.
  • Extended language ranges only need the * for missing fields between concrete subtags (or for a missing primary language subtag), e.g. ja-*-JP or *-US. While multiple * wildcards are permitted, they don't do anything. Note that the language subtag for Japanese is ja.
  • Most (but not all) languages are primarily written in a single script. The IANA Language Subtag Registry (ILSTR) contains Suppress-Script fields for some languages like this and CLDR provides more information about this as well. This means that there is no need to refer to the script for most of these languages. Only languages such as (for example) Serbian, Azerbaijani, or Chinese, which are customarily or at least reasonably commonly written in more than one script need to fuss with them. Transliteration, such as for the Brai script or to various non-native representations (ja-Latn, hi-Latn`, etc.) are probably not something that MT language arcs provide for on input (they might provide output, though).
    • Note that Brai is a misleading example. Braille readers are generally assistive equipment and consume the source language in its normal textual representation/script.
  • There are a variety of issues with the Chinese examples you are using which I don't think are that useful for the purposes of discussion.

we have to recognize that the ground truth of what is supported is a per-user-agent set of machine learning translation models

There are two problems that you have here: selection and description. Selection refers to the (sometimes human-involved) process of choosing which language arcs can be applied to a given input text and then employing the best one for the task. Description involved making clear the internal settings/limitations of a given language arc.

For example, an arc such as en => fr might support any English variety, including either US or UK/International orthographic variations (it doesn't need to care how you spelled jail/gaol or colo(u)r and it can deal with you calling it the sidewalk or the pavement). The output, obviously, will be in French. But which French? It might use fr-FR as "Standard French". Does it also use the fr-FR locale to format dates and numbers included in the translation? What if the user prefers a regionally variant formatting, such as fr-SN (French/Senegal)? At the same time, in many cases, since the translation will not be exact, users might not care about the many many options.

If the there is a reverse language arc available, the output will not just be en, since it must at least choose between en-US and en-001 (aka en-GB) orthographic variations (just the French one had to choose between, say fr-FR and fr-CA). I'm simplifying here, so don't quote the examples against me πŸ˜„

The en-001 in my example points up something else about MT language arc models. Many use "artificial" languages to be more general (or because MT cannot be so precise). For example, es-419 (Latin American Spanish) is a language spoken by no one, but read/consumed by many Spanish speakers. Modern Standard Arabic (ar) is a language that is both written and read by very many Arabic speakers, but not exactly anyone's spoken language (it's complicated). Our technology isn't so good that translations produce idiomatic replacements ("pot calling a kettle black" => "λ˜₯ 묻은 κ°œκ°€ 겨 묻은 개 λ‚˜λ¬΄λž€λ‹€" (dog stained with poo laughing at dog stained with rice: I copied the Korean from elsewhere, so it might be horribly wrong))

You might support this by using lists of tags on either side of the arc description or using language ranges. User's might prefer labels like en => fr with some information that it's really en (en-001, en-US, en-GB, en-AU, en-AE, ...) => fr (fr-FR)

Thanks again for your help. I appreciate your general notes and corrections. I used the expanded 7-segment format for the extended language tags because I otherwise found it confusing, but I appreciate that people who have more experience in the field don't need that.

I agree with your framing of selection vs. description. In terms of the API I think that comes down to:

  • Description: the results of canTranslate() / canDetect()
  • Selection: given a user-provided { sourceLanguage, targetLanguage } pair, and a set of language arcs that the browser has, should the browser attempt to use a language arc? If so, which one? Or should it throw an error?

Anyway, I think I got too ambitious trying to give hypothetical examples and a full algorithm. Let me try to be more concrete. I'll focus just on description for now to scope it down further.

Let's say I was going to ship a translation API representing the capabilities of Google Translate's Japanese to English mode. Here are some representative inputs and outputs:

Input Output
元気ですか? How are you?
γ’γ‚“γγ§γ™γ‹οΌŸ How are you?
ゲンキデスカ Are you well?
genkidesuka how are you
β ›β ‘β β Šβ …β Šβ ™β ‘β Žβ ₯⠅⠁ β ›β ‘β β Šβ …β Šβ ™β ‘β Žβ ₯⠅⠁
π˜π‡π€π†π—π†π”π‡ππŠπ—π€ π˜π‡π€π†π—π†π”π‡ππŠπ—π€
γ„γ„£γ„Žγ„§ γ„‰γ„œγ„™γ„¨ γ„Žγ„šοΌŸ γ„γ„£γ„Žγ„§ γ„‰γ„œγ„™γ„¨ γ„Žγ„šοΌŸ
硐び぀き connection
2ドル 2 dollars
携帯 cell phone
色 color
いろ colour
iro iro
irohaaoudesu The color is blue.

What should the answers be to the following, in your opinion?

canTranslate("ja", "en");           // Presumably this should work

canTranslate("ja", "en-US");        // "color" (like 色)
canTranslate("ja", "en-GB");        // "colour" (like いろ); "mobile phone" instead of "cell phone"
canTranslate("ja", "en-SG");        // "2 dollar" instead of "2 dollars"
canTranslate("ja", "en-150");       // "mobile" instead of "cell phone"

canTranslate("ja", "en-GB-oed");    // I think this would require 硐び぀き => "connexion"

canTranslate("ja", "en-Latn");      // Should this work?
canTranslate("ja", "en-Brai");      // Presumably should not work
canTranslate("ja", "en-Dsrt");      // Presumably should not work

canTranslate("ja", "en-x-pirate");  // Presumably should not work, unless we blanket grant x-?
canTranslate("ja", "en-x-lolcat");  // Presumably should not work, unless we blanket grant x-?

// Various unknown subtags cases, how should these work?
canTranslate("ja", "en-asdf");
canTranslate("ja", "en-x-asdf");
canTranslate("ja", "en-US-asdf");
canTranslate("ja", "en-US-x-asdf");
canTranslate("ja", "en-asdf-asdf");

canTranslate("ja-JP", "en");        // Presumably this should work
canTranslate("ja-JP-Jpan", "en");   // Should this work, or is it bad because of the Suppress-Script?
canTranslate("ja-JP-Hrkt", "en");   // Should this work? It seems to.
canTranslate("ja-Kana", "en");      // Should this work? It seems to.
canTranslate("ja-Latn", "en");      // Should this work? It did for "genkidesuka"/"irohaaoudesu" but not for "iro".

canTranslate("ja-Braille", "en");   // Presumably shouldn't work ("β ›β ‘β β Šβ …β Šβ ™β ‘β Žβ ₯⠅⠁" example)
canTranslate("ja-Bopo", "en");      // Presumably shouldn't work ("γ„γ„£γ„Žγ„§ γ„‰γ„œγ„™γ„¨ γ„Žγ„šοΌŸ" example)
canTranslate("ja-Dsrt", "en");      // Presumably shouldn't work ("π˜π‡π€π†π—π†π”π‡ππŠπ—π€" example)

// Using the rarely-used jpx "collection" tag; should it work?
canTranslate("jpx-ja", "en");
canTranslate("jpx-Jpan", "en");

// Unusual/unknown subtag cases; how should they work?
canTranslate("ja-KR", "en");
canTranslate("ja-US", "en");
canTranslate("ja-asdf", "en");
canTranslate("ja-Jpan-JP-x-osaka", "en");
canTranslate("ja-JP-u-ca-japanese", "en");
canTranslate("ja-x-kansai", "en");
canTranslate("ja-JP-u-sd-jpjp", "en");

If you think there's a clear algorithm that resolves a lot of these cases, feel free to suggest that instead of answering each one.

For the source languages in your examples, all of the ja tags match whatever ja-* tagged language arcs that are installed.

The longer tags present some questions. (note that ja-JP-Jpan etc. should be ja-Jpan-JP etc.; note that ja-Braille is not valid and presumably ja-Brai is intended.)

If the user specifies a regional variation on the source side, they might want the tag to fall back when matching (that is, use BCP47 Lookup), because the source language is not visible in the output and because translation engines are usually less sensitive to linguistic variations. If the text is written in a non-default script, the translation engine might prefer if the text were transliterated or might (as in the Deseret example) not know what to do with it and pass it through. In either case, there is no harm is "losing" distinction found on tags like ja-KR or ja-Jpan-JP-x-osaka to find the ja=>en engine.

Suppress-Script tags can interfere with matching when matching is done by strict string comparison of the tags. That is, the range ja-Jpan-JP does not match the tag ja-JP because it is not a prefix of ja-JP. A range like ja-Latn provides valuable information to the translation engine, but the engine would have to decide if it did something special with the information.

Private use sequences (starting with -x-) are usually default-ignorable. Implementations could decide to support specific private use sequences, of course. The -u and -t extensions are also probably ignorable, although -t would give the translation engine a lot of information about the transformation (transliteration of the script) previously applied. On the source side, if you used Lookup, all of your tags would work, even the incorrect ones.

On the target side, there is some question in my mind about what canTranslate means. The language arc ja=>en produces output that a speaker of en-SG, en-GB-oxendict (en-GB-oed is grandfathered and deprecated) or en-asdf-asdf would understand. So, if installed, the result should be yes (er, readily πŸ˜ƒ).

On the other hand, as your examples point out, the additional subtags represent variations that the user might want. US vs. UK spelling variation or, UK vs. OED spelling variation (one variation oxendict implies is that internationalisation is spelled with a z, e.g. internationalization).

This suggests that script or region subtags (and maybe variants) in the user's specified range should not be ignored. Even if the ja=>en arc can process a request like ja=>en-Brai, it might reasonably reject it, not being able to produce the required transliteration. Locale variation, such as your en-SG example (I do not agree with your expected output) might be applied for formattable values like time, date, numbers, and such, but might or might not affect orthographic variation. The list of regional subtags is likely to exceed the range of available variation. Using ja as the example, ja-JP is almost certainly available, but ja-CR (Costa Rica) probably has no meaning?

From a standards perspective, could we say that it is implementation defined how the matching takes place (implementation here meaning "of the translation engine", not the API)? Google Translate can decide whether it can "readily" handle a given tag as output or not and the answer might vary depending on the specific language arc.

From a standards perspective, could we say that it is implementation defined how the matching takes place (implementation here meaning "of the translation engine", not the API)? Google Translate can decide whether it can "readily" handle a given tag as output or not and the answer might vary depending on the specific language arc.

There will definitely have to be some implementation-definedness in the standard, simply because we can't pin down what the capabilities of each implementation will be. But I'd like to give some guidance, probably in the spec. Because just as a Chromium engineer, I need to know what we should make our API return! Our default course of action was the one originally mentioned in the explainer (~exact matching), which you said doesn't make sense.

So summarizing your answers above, I'm getting the following:

  • On the source side
    • Ignore all -x- private use sequences
    • Ignore all region subtags
    • Unclear how to handle script subtags (ja-Latn vs. ja-Brai vs. ja-Dsrt vs. ja-Jpan)
    • Unclear how to handle collection tags (jpx-ja, jpx-Jpan)
    • Unclear how to handle variant subtags (I didn't give any examples on the source side)
    • Unclear how to handle extension subtags (ja-JP-u-ca-japanese, ja-JP-u-sd-jpjp)
  • On the target side: it's unclear whether we should go permissive or restrictive.
    • Probably we should ignore private use sequences no matter what?
    • If permissive: probably ignore region subtags, variant subtags, and script subtags?
    • If restrictive: allow only certain region subtags (e.g. ja-JP but not ja-CR); disallow variant subtags; allow only certain script subtags?

I'm unsure whether BCP 47 Lookup or Filtering plays into any of the above suggestions.

I'd appreciate any help in fleshing this out. In particular it might be helpful to stay focused on just the specific example I gave. Which of the (source, target) pairs that I gave should work, given the demonstrated capabilities of the Google Translate Japanese-to-English model? Which should not? Eventually we'll try to extract those out into wider guidance for implementers.

But at the risk of side-tracking us, let me just illustate how other non-web APIs seem to work, which is similar to the model you said doesn't make sense. They have static lists of "supported languages", which are specific strings. E.g.: Azure, Google Cloud, DeepL. Sometimes (as is the case with DeepL) they have different source and target lists. In most cases the languages are simple two- or three-letter codes, but there are often some subtags used: e.g.

  • Azure: zh-Hans + zh-Hant, fr-ca, iu + iu-Latn, mn-Cyrl + mn-Mong, ku-arab + ku + ku-latn, more.
  • Google Cloud: zh-CN + zh-TW.
  • DeepL: source languages are all two-letter codes; target languages include en-GB + en-US (en is deprecated); pt-BR + pt-PT (pt is deprecated); zh-Hans + zh-Hant (zh is deprecated).

I get that on the web we're holding ourselves to higher standards for API design. But dang, this is just so simple. And a lot of developers are using such APIs today. If we want something more complicated, I and all the other implementers need some help figuring it out...