Add Samanantar datasets.
BrightXiaoHan opened this issue · 3 comments
BrightXiaoHan commented
Samanantar is the largest publicly available parallel corpora collection for Indic languages: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu. The corpus has 49.6M sentence pairs between English to Indian Languages.
thammegowda commented
thammegowda commented
@BrightXiaoHan Thanks for creating this issue. If this is urgent, could you please update this link with v0.3 (or newest) from https://ai4bharat.iitm.ac.in/samanantar
mtdata/mtdata/index/ai4bharat.py
Line 17 in c57dab5
and test if works! Thanks
BrightXiaoHan commented
Thanks for your reply. I will try to test it.