thammegowda/mtdata

Add Samanantar datasets.

BrightXiaoHan opened this issue · 3 comments

Samanantar is the largest publicly available parallel corpora collection for Indic languages: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu. The corpus has 49.6M sentence pairs between English to Indian Languages.

https://ai4bharat.iitm.ac.in/samanantar

  • Related #119
  • We already had it, (#34) but they changed the links.

@BrightXiaoHan Thanks for creating this issue. If this is urgent, could you please update this link with v0.3 (or newest) from https://ai4bharat.iitm.ac.in/samanantar

BASE_v0_2 = 'https://storage.googleapis.com/samanantar-public/V0.2/data/{dirname}/{pair}.zip'

and test if works! Thanks

Thanks for your reply. I will try to test it.