KeyError: u' when using --html
Closed this issue · 4 comments
Bubu commented
When using WikiExtractor.py --no-templates --html en_wiki_500.xml
where en_wiki_500.xml
is roughly the first 500 MB of the full english wikipedia dump, I get 6 such errors:
Process Process-3:
Traceback (most recent call last):
File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "../Programming/wiki/wikiextractor-master/WikiExtractor.py", line 2494, in extract_process
Extractor(*job[:3]).extract(out) # (id, title, page)
File "../Programming/wiki/wikiextractor-master/WikiExtractor.py", line 442, in extract
for line in compact(text):
File "../Programming/wiki/wikiextractor-master/WikiExtractor.py", line 2122, in compact
page.append(listItem[n] % line)
KeyError: u' '
attardi commented
Fixed.
abukva commented
I have the same problem with the newest build.
attardi commented
Which command are you using on which version of Wikipedia?
abukva commented
python WikiExtractor.py --processes 2 -b 10M --html -l --no-templates wiki.xml
with the latest version of wikipedia
this is the link to the xml file http://burnbit.com/torrent/417422/enwiki_20151102_pages_articles_xml_bz2