zNLP : Identifying parallel sentences in Chinese-English comparable corpora for the BUCC 2017 Shared Task
.
├── code
└── data
├── bucc2017
│ ├── test_data
│ │ ├── zh-en.test.en
│ │ └── zh-en.test.zh
│ └── training_data
│ ├── zh-en.training.en
│ ├── zh-en.training.gold
│ └── zh-en.training.zh
├── dictionaries
├── stopwords
└── temp_data
└── classifier
├── test
└── training
test_data and training_data could be downloaded from BUCC 2017 Shared Task web page. As for the dictionaries, I use CC-CEDICT and the restricted data Chinese-English Translation Lexicon Version 3.0 [LDC2002L27]
(Huang et al., 2002) to generate Chinese-English dictionaries (generating functions are provided in ChineseEnglishDictionary
class of chinese_corpus_translator.py
). You could also generate your own dictionaries by using other resources and then don't forget to configure the dictionaries path in config.ini
.
Precision | Recall | F1-score | Remark |
---|---|---|---|
0.4242 | 0.4441 | 0.4339 | Baseline (First public version for paper review) |
0.4247 | 0.4815 | 0.4513 | From functional programming to OOP; Debugs |
0.4293 | 0.5348 | 0.4763 | solr_topN changed from 3 to 1; Remove Solr_index feature (as it is always 1); New overlap function |
0.4370 | 0.5506 | 0.4873 | Independent corpus for overlap calculation: search engine Chinese tokenizer mode (full mode for the Solr searching corpus), remove English stop words and do English stemming. |
0.6542 | 0.4441 | 0.5291 | New SVM parameters (class_weight changes from 1:8 to 1:3 and C changes from 1.0 to 10.0 ) |