[BERT/PyTorch] Extract Wiki Fails
Closed this issue · 6 comments
Related to BERT/PyTorch
Describe the bug
Can not extract wikipedia after downloading
WikiExtractor.err
INFO: Preprocessing '/workspace/bert/data/wikipedia/wikicorpus-en.xml.bz2' to collect template definitions: this may take some time.
INFO: Preprocessed 100000 pages
INFO: Preprocessed 200000 pages
INFO: Preprocessed 300000 pages
INFO: Preprocessed 400000 pages
INFO: Preprocessed 500000 pages
.....
INFO: Preprocessed 21600000 pages
INFO: Preprocessed 21700000 pages
INFO: Preprocessed 21800000 pages
INFO: Loaded 731987 templates in 3178.9s
INFO: Starting page extraction from /workspace/bert/data/wikipedia/wikicorpus-en.xml.bz2.
INFO: Using 5 extract processes.
Process ForkProcess-2:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.8/site-packages/wikiextractor/WikiExtractor.py", line 484, in extract_process
Extractor(*job[:-1]).extract(out, html_safe) # (id, urlbase, title, page)
File "/opt/conda/lib/python3.8/site-packages/wikiextractor/extract.py", line 976, in extract
text = self.clean_text(text, html_safe=html_safe)
File "/opt/conda/lib/python3.8/site-packages/wikiextractor/extract.py", line 963, in clean_text
text = clean(self, text, expand_templates=expand_templates,
File "/opt/conda/lib/python3.8/site-packages/wikiextractor/extract.py", line 86, in clean
text = extractor.expandTemplates(text)
File "/opt/conda/lib/python3.8/site-packages/wikiextractor/extract.py", line 86, in clean
text = extractor.expandTemplates(text)
File "/opt/conda/lib/python3.8/bdb.py", line 88, in trace_dispatch
return self.dispatch_line(frame)
File "/opt/conda/lib/python3.8/bdb.py", line 113, in dispatch_line
if self.quitting: raise BdbQuit
bdb.BdbQuit
Process ForkProcess-4:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.8/site-packages/wikiextractor/WikiExtractor.py", line 484, in extract_process
Extractor(*job[:-1]).extract(out, html_safe) # (id, urlbase, title, page)
File "/opt/conda/lib/python3.8/site-packages/wikiextractor/extract.py", line 976, in extract
text = self.clean_text(text, html_safe=html_safe)
File "/opt/conda/lib/python3.8/site-packages/wikiextractor/extract.py", line 963, in clean_text
text = clean(self, text, expand_templates=expand_templates,
File "/opt/conda/lib/python3.8/site-packages/wikiextractor/extract.py", line 86, in clean
text = extractor.expandTemplates(text)
File "/opt/conda/lib/python3.8/site-packages/wikiextractor/extract.py", line 86, in clean
text = extractor.expandTemplates(text)
File "/opt/conda/lib/python3.8/bdb.py", line 88, in trace_dispatch
return self.dispatch_line(frame)
File "/opt/conda/lib/python3.8/bdb.py", line 113, in dispatch_line
if self.quitting: raise BdbQuit
bdb.BdbQuit
Process ForkProcess-3:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.8/site-packages/wikiextractor/WikiExtractor.py", line 484, in extract_process
Extractor(*job[:-1]).extract(out, html_safe) # (id, urlbase, title, page)
File "/opt/conda/lib/python3.8/site-packages/wikiextractor/extract.py", line 976, in extract
text = self.clean_text(text, html_safe=html_safe)
File "/opt/conda/lib/python3.8/site-packages/wikiextractor/extract.py", line 963, in clean_text
text = clean(self, text, expand_templates=expand_templates,
File "/opt/conda/lib/python3.8/site-packages/wikiextractor/extract.py", line 86, in clean
text = extractor.expandTemplates(text)
File "/opt/conda/lib/python3.8/site-packages/wikiextractor/extract.py", line 86, in clean
text = extractor.expandTemplates(text)
File "/opt/conda/lib/python3.8/bdb.py", line 88, in trace_dispatch
return self.dispatch_line(frame)
File "/opt/conda/lib/python3.8/bdb.py", line 113, in dispatch_line
if self.quitting: raise BdbQuit
bdb.BdbQuit
Process ForkProcess-6:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.8/site-packages/wikiextractor/WikiExtractor.py", line 484, in extract_process
Extractor(*job[:-1]).extract(out, html_safe) # (id, urlbase, title, page)
File "/opt/conda/lib/python3.8/site-packages/wikiextractor/extract.py", line 976, in extract
text = self.clean_text(text, html_safe=html_safe)
File "/opt/conda/lib/python3.8/site-packages/wikiextractor/extract.py", line 963, in clean_text
text = clean(self, text, expand_templates=expand_templates,
File "/opt/conda/lib/python3.8/site-packages/wikiextractor/extract.py", line 86, in clean
text = extractor.expandTemplates(text)
File "/opt/conda/lib/python3.8/site-packages/wikiextractor/extract.py", line 86, in clean
text = extractor.expandTemplates(text)
File "/opt/conda/lib/python3.8/bdb.py", line 88, in trace_dispatch
return self.dispatch_line(frame)
File "/opt/conda/lib/python3.8/bdb.py", line 113, in dispatch_line
if self.quitting: raise BdbQuit
bdb.BdbQuit
Process ForkProcess-5:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.8/site-packages/wikiextractor/WikiExtractor.py", line 484, in extract_process
Extractor(*job[:-1]).extract(out, html_safe) # (id, urlbase, title, page)
File "/opt/conda/lib/python3.8/site-packages/wikiextractor/extract.py", line 976, in extract
text = self.clean_text(text, html_safe=html_safe)
File "/opt/conda/lib/python3.8/site-packages/wikiextractor/extract.py", line 963, in clean_text
text = clean(self, text, expand_templates=expand_templates,
File "/opt/conda/lib/python3.8/site-packages/wikiextractor/extract.py", line 86, in clean
text = extractor.expandTemplates(text)
File "/opt/conda/lib/python3.8/site-packages/wikiextractor/extract.py", line 86, in clean
text = extractor.expandTemplates(text)
File "/opt/conda/lib/python3.8/bdb.py", line 88, in trace_dispatch
return self.dispatch_line(frame)
File "/opt/conda/lib/python3.8/bdb.py", line 113, in dispatch_line
if self.quitting: raise BdbQuit
bdb.BdbQuit
**To Reproduce**
Steps to reproduce the behavior:
1. bash scripts/docker/build.sh
2. bash scripts/docker/launch.sh
3. /workspace/bert/data/create_datasets_from_start.sh
Expected behavior
Extracted Wiki-file for lddl preprocessing
I solved this by pip install wikiextractor==3.0.4
3.0.6 doesn't work, 3.0.4 works.
Hi @Hannibal046 ,
The issue that I filed and you linked is the correct issue that was causing this problem. More specifically, attardi/wikiextractor@05b5cc7#diff-e315b31c6987451055799a11146b20a0d0c86bbd2a26d5d104e6fb9cfa805511R85 was the problematic line of change. Note that, if you check the commit log, this change happened after @attardi created the V3.0.6
tag, which means that installing the wikiextractor
from the V3.0.6
tag would be fine (this is what 29f5b7a is about).
I distinctly remembered that when I was developing LDDL, I was using a certain feature from wikiextractor
that was only available in 3.0.5+; maybe eventually I decided to not use that feature - this has been a while ago, so my memory is vague. If 3.0.4
works, it's fine; if otherwise, checkout the latest code which has the fix.
Thanks!
GOT IT!
So much THANKS!
BERT TensorFlow2 also has this issue, until attardi/wikiextractor#283 is solved
workaround until this issue is solved:
rm -rf /workspace/wikiextractor/
(cd /workspace; git clone https://github.com/attardi/wikiextractor.git -b v3.0.6)
bash -x data/create_datasets_from_start.sh