This project is a Cantonese-English Parallel Corpus extracted from the ABC Cantonese-English Comprehensive Dictionary. It consists of around 14,000 sentences. The aim of this project is to provide high quality parallel data for developing Cantonese-English translation models, in order to facilitate the advancement of Cantonese NLP research.
Due to copyright issues, the extracted corpus cannot be directly published on GitHub. You can download it through the following two methods.
The parallel corpus is saved in two files, yue.txt
and en.txt
.
To download the files, simply use the following commands:
gdown 1WJ7bWgIhus-geMqwWoyt_POalgrJxuwj # yue.txt
gdown 1XbO6POEbjeiYuIZe_SN9ECv571IRyz2T # en.txt
Additionally, you can also download the title list (titles.txt
) and raw data (Wenlin+Dictionaries-20221101051901.xml
), which are the intermediate results of the build process.
This repository provides scripts for building the corpus from source. You can re-run these scripts to obtain the latest version of the corpus. It is worth noting that the new version of the corpus will be different from the one provided above. Therefore, if you are using this corpus in your research, please use the version provided above if possible.
Steps to build the corpus from source:
- Register an account on the Wenlin Dictionaries Wiki;
- Edit
scrape.py
to add your credentials on the Wenlin Dictionaries Wiki; - Run
scrape.py
to get a list of the titles of all pages under theJyut
category. The result is written totitles.txt
; - Go to the export page to export all the data to an XML file;
- Run
extract.py
to build the corpus; - Manually validate the build results.
1. Modification of the selection of Chinese characters
I change the selection of Chinese characters to the modern Hong Kong convention or the words.hk convention, in order to accurately reflect the generally accepted habit of the selection of Chinese characters of Hong Kong people. For example:
- 床 -> 牀
- 著 -> 着, as in 着衫
- 𡃶 -> 錫, as in 錫佢一啖
- 𧨾 -> 氹, as in 氹阿媽開心
- 𧵳 -> 蝕, as in 生意蝕本
- 杧 -> 芒, as in 芒果
- 𠶧 -> 掂, as in 橫掂
2. Add full stops at the end of the sentences
The original dictionary does not include a full stop at the end of a declarative sentence, as is the case in both Cantonese and English. This can be confusing because both Cantonese and English use a full stop as a marker at the end of a declarative sentence.
3. Remove non-informative spaces
I remove all the spaces between Chinese characters and English letters, as well as the spaces between Chinese characters and digits. Spaces between two English words are not removed.
For example, the space in the following sentence is removed:
呢場戲NG 咗兩次
While the space in this sentence is not:
佢積極keep fit,身材好咗好多。
This is because the spaces between Chinese characters and English letters and Chinese characters and numbers do not affect the understanding of sentences and can be easily converted to and from each other by rules when needed.
4. TODO: ...