A tool for generating massive parallel corpus with Wikidata.
- Extract descriptions
python extract_top_desc.py
- Generate parallel triples (center_sent, pos_sent, neg_sent), e.g., this.
By default, it randomly picks three languages to generate triples each time. If you need to fix the center language, change the constant
FIXED_CENTER_LANG
inside to a specific language, e.g., "en"
python gen_train.py