attardi/wikiextractor

KeyError: u' when using --html

Closed this issue · 4 comments

Bubu commented

When using WikiExtractor.py --no-templates --html en_wiki_500.xml
where en_wiki_500.xml is roughly the first 500 MB of the full english wikipedia dump, I get 6 such errors:

Process Process-3:
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "../Programming/wiki/wikiextractor-master/WikiExtractor.py", line 2494, in extract_process
    Extractor(*job[:3]).extract(out)  # (id, title, page)
  File "../Programming/wiki/wikiextractor-master/WikiExtractor.py", line 442, in extract
    for line in compact(text):
  File "../Programming/wiki/wikiextractor-master/WikiExtractor.py", line 2122, in compact
    page.append(listItem[n] % line)
KeyError: u' '

Fixed.

I have the same problem with the newest build.

Which command are you using on which version of Wikipedia?

python WikiExtractor.py --processes 2 -b 10M --html -l --no-templates wiki.xml

with the latest version of wikipedia

this is the link to the xml file http://burnbit.com/torrent/417422/enwiki_20151102_pages_articles_xml_bz2