The Cantonese (Yue Chinese, `yue_Hant`) data in FLORES-200 is not Cantonese at all

Question

The Cantonese (Yue Chinese, `yue_Hant`) data in FLORES-200 is not Cantonese at all

ayaka14732 opened this issue a year ago · 1 comments

The Cantonese (Yue Chinese, yue_Hant) data in FLORES-200 is completely wrong. The data is not Cantonese at all, but rather Mandarin Chinese in Traditional Chinese Script (zho_Hant), which only has stylistic differences compared to the zho_Hant data in the dataset.

Furthermore, the paper mentioned that the yue_Hant and zho_Hant data tend to be predicted as each other. It turns out that both datasets actually consist of zho_Hant data exclusively. yue_Hant and zho_Hant should actually be very easy to distinguish from each other.

Here is how correct yue_Hant data would look like:

Language Code	Sentence
`eng_Latn`	They found the Sun operated on the same basic principles as other stars: The activity of all stars in the system was found to be driven by their luminosity, their rotation, and nothing else.
`zho_Hant`	他們發現太陽的運作與其他恆星的基本原理相同：系統中所有恆星的活動均受其光度、自轉所推動，就是這麼簡單。
`yue_Hant` (wrong)	他們發現，太陽和其他恆星的運行原理是一樣的：系統中所有恆星的活動都是由它們的亮度、自轉驅動的，而並非其他因素。
`yue_Hant` (corrected)	佢哋發現，太陽同其他恆星嘅運行原理冇分別：系統入面所有恆星嘅活動都淨係由佢哋嘅亮度同自轉推動，而唔包括其他因素。

(Bold denotes words that are used exclusively in yue_Hant)

Answer 1 · 2023-06-01T16:33:40.000Z

This has been complaint by others for a long time https://twitter.com/chaakming/status/1555246138105614336

I guess nobody in the FLORES team knows Cantonese and Mandarin well enough to understand the unique situation of this language. The current data collected for yue is Hong Kong Chinese, NOT Cantonese. We recommend using this classifier to filter the real Cantonese data https://github.com/CanCLID/cantonese-classifier