IlyaSemenov/wikipedia-word-frequency

Docs could make it clearer what file to get

Closed this issue · 0 comments

tbm commented

The docs could make it clearer which XML file is needed.

For Cebuano, I see two options:

  • cebwiki-latest-pages-articles-multistream.xml.bz2
  • cebwiki-latest-pages-articles.xml.bz2

The instructions say:

wget -np -r --accept-regex 'https:\/\/dumps\.wikimedia\.org\/enwiki\/latest\/enwiki-latest-pages-articles[0-9]+\..*'

which suggests that the multistream one is wrong and I need the normal one. Your regex won't match that since it has no digit.

I guess enwiki-latest-pages-articles[0-9]*\.xml.bz2 might be better, with a note to replace en with whatever language you want.