/CLIRMatrix

Primary LanguageShell

CLIRMatrix

http://www.cs.jhu.edu/~shuosun/clirmatrix/

Alternatively, CLIRMatrix is also available in the following google drive:

https://drive.google.com/drive/folders/1V-DcBwvAnlVAYJw_gsx0zXV5VXJcRGGc?usp=sharing

Script to extract untruncated documents from Wikipedia dumps:

Usage:
    ./extract.sh [wikipedia language code]
E.g.
    ./extract.sh en

Reference

[1] Shuo Sun, Kevin Duh CLIRMatrix: A massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)