facebookresearch/LASER

Problem with wet_lines

vmenan opened this issue · 4 comments

Hi,
I am a researcher working on Low resource languages native to sri lanka (which is Sinhala and Tamil). NLLB mined dataset is a excellent start point for us. So i am using the instructions provided on how to download the mined dataset using the metadata provided here . The issue im facing is the meta data contains data from paracrawl as well, but the scripts and instructions provided work only for common crawl data. Am i going wrong on how to obtain the mined data from NLLB200?

Hi @vmenan, an easier entry point to the mined data might be here. Hopefully this helps!

@heffernankevin you are a life saver!, was struggling with the download for a week. Wow this actually really helps. Thank you so much! Im wondering why it wasnt it mentioned mentioned here ?

No problem! Will make a TODO to add this.

vmenan commented

Thats great to hear. Once again thank you so much for you help! Appreciate it! Also props to FAIR research team to open sourcing their excellent work to the community, Thank you!