Clarification for Chinese language variants

Question

Clarification for Chinese language variants

raunaksinhacisco opened this issue 2 years ago · 1 comments

raunaksinhacisco commented 2 years ago

There are three variants of Chinese that are listed a supported languages in LASER -

Chinese (zh)
Yue Chinese (yue)
Wu Chinese (Wu)

In platforms like Google usually Chinese is available in three forms -

Chinese (Simplified, China, zh-CN)
Chinese (Traditional, Taiwan, zh-TW)
Chinese (Traditional, Hong Kong, zh-HK)

Can someone please help me with the following questions -

Is Chinese (zh) used for both traditional and simplified Chinese?
Yue and Wu Chinese are verbal dialects the closest written form of Yue is Cantonese (denoted by Chinese (Traditional, Hong Kong, zh-HK?) . How are non-written dialects used in LASER?

Answer 1 · 2023-06-12T12:38:57.000Z

Hi @raunaksinhacisco! The version of Chinese would depend on the training sets used, and unfortunately indeed the .zh language code is underspecified. However, the more recent LASER3 encoders do have explicit support for both Simplified and Traditional Chinese! In order to embed using these specific models you can perform the following:

Go to the LASER3 (NLLB) model page and download the Simplified and Traditional Chinese models. This is done using the following command: bash ./download_models.sh zho_Hans zho_Hant. NOTE: "zho_Hans" and "zho_Hant" are the FLORES200 language codes for simplified and traditional Chinese respectively.

Regarding Yue and Wu Chinese training sets, there are bitexts available from sources such as: https://opus.nlpl.eu/Tatoeba.php

Hope this helps!