attardi/wikiextractor

No articles get extracted

markdimi opened this issue · 2 comments

I have downloaded the english wiki dump enwiki-20160305-pages-articles-multistream.xml.bz2 and installed Wikiextractor in a Debian VM.

When I ran the extractor I am getting 0 articles in return and no errors:

WikiExtractor.py -b 250K -o extracted enwiki-20160305-pages-articles-multistream.xml.bz2

INFO: Loaded 0 templates in 0.0s
INFO: Starting page extraction from enwiki-20160305-pages-articles-multistream.xml.bz2.
INFO: Using 1 extract processes.
INFO: Finished 1-process extraction of 0 articles in 0.1s (0.0 art/s)

You should use the pages dump:

http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

On 18 mag 2016, at 23:58, Dimits Mark notifications@github.com wrote:

I have downloaded the english wiki dump enwiki-20160305-pages-articles-multistream.xml.bz2 and installed Wikiextractor in a Debian VM.

When I ran the extractor I am getting 0 articles in return and no errors:

WikiExtractor.py -b 250K -o extracted enwiki-20160305-pages-articles-multistream.xml.bz2

INFO: Loaded 0 templates in 0.0s
INFO: Starting page extraction from enwiki-20160305-pages-articles-multistream.xml.bz2.
INFO: Using 1 extract processes.
INFO: Finished 1-process extraction of 0 articles in 0.1s (0.0 art/s)


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub #61

I will try to. For now I have settled with another wiki parser that worked with my file. Do you mind telling me what is the difference with the file that I have and the one you are suggesting me?
Thank you