Docs could make it clearer what file to get
Closed this issue · 0 comments
tbm commented
The docs could make it clearer which XML file is needed.
For Cebuano, I see two options:
- cebwiki-latest-pages-articles-multistream.xml.bz2
- cebwiki-latest-pages-articles.xml.bz2
The instructions say:
wget -np -r --accept-regex 'https:\/\/dumps\.wikimedia\.org\/enwiki\/latest\/enwiki-latest-pages-articles[0-9]+\..*'
which suggests that the multistream one is wrong and I need the normal one. Your regex won't match that since it has no digit.
I guess enwiki-latest-pages-articles[0-9]*\.xml.bz2
might be better, with a note to replace en
with whatever language you want.