Number of datasets not equal to those in WikiANN paper?

Question

Number of datasets not equal to those in WikiANN paper?

lewtun opened this issue 4 years ago · 2 comments

Hello,

First of all, thank you for creating and sharing your datasets with the community!

As far as I can tell, you provide 176 WikiANN datasets in your Amazon Cloud drive and I was wondering why this isn't the 282 quoted in the Pan et al paper?

Also, for some experiments (e.g. on Google Colab) it would be very convenient to be able to download the datasets using curl / wget, but this doesn't seem to be possible with Amazon Drive - do you happen to know if there's a way to download the datasets from the terminal?

Thanks!

Answer 1 · 2020-12-03T01:34:40.000Z

Hi Lewtun,
Thanks for your interest in this.
The reason why the number of language coverage is different from Pan et al: The number of examples in the removed languages were very small, so it was not possible to have a reasonable train/dev/test split. The Pan et. al dataset is available online (linked in their paper), you can download the extra languages and create your own train/dev/test splits if you are interested.

Thanks for the suggestion, I added a new link in readme that works with wget/curl.

Answer 2 · 2020-12-03T23:07:32.000Z

Hi @afshinrahimi thanks a lot for the information and for making the dataset downloadable from the terminal! I'll close this issue since it is now resolved 😄