KeyError: u' when using --html

Question

KeyError: u' when using --html

Closed this issue 9 years ago · 4 comments

When using WikiExtractor.py --no-templates --html en_wiki_500.xml
where en_wiki_500.xml is roughly the first 500 MB of the full english wikipedia dump, I get 6 such errors:

Process Process-3:
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "../Programming/wiki/wikiextractor-master/WikiExtractor.py", line 2494, in extract_process
    Extractor(*job[:3]).extract(out)  # (id, title, page)
  File "../Programming/wiki/wikiextractor-master/WikiExtractor.py", line 442, in extract
    for line in compact(text):
  File "../Programming/wiki/wikiextractor-master/WikiExtractor.py", line 2122, in compact
    page.append(listItem[n] % line)
KeyError: u' '

attardi commented 9 years ago

Fixed.

Answer 1 · 2015-11-28T02:38:11.000Z

I have the same problem with the newest build.

Answer 2 · 2015-11-28T08:24:49.000Z

Which command are you using on which version of Wikipedia?

Answer 3 · 2015-11-28T12:18:43.000Z

python WikiExtractor.py --processes 2 -b 10M --html -l --no-templates wiki.xml

with the latest version of wikipedia

this is the link to the xml file http://burnbit.com/torrent/417422/enwiki_20151102_pages_articles_xml_bz2