NVIDIA/DeepLearningExamples

[BERT/PyTorch] Extract Wiki Fails

Closed this issue · 6 comments

Related to BERT/PyTorch

Describe the bug
Can not extract wikipedia after downloading

WikiExtractor.err

INFO: Preprocessing '/workspace/bert/data/wikipedia/wikicorpus-en.xml.bz2' to collect template definitions: this may take some time.
INFO: Preprocessed 100000 pages
INFO: Preprocessed 200000 pages
INFO: Preprocessed 300000 pages
INFO: Preprocessed 400000 pages
INFO: Preprocessed 500000 pages
.....
INFO: Preprocessed 21600000 pages
INFO: Preprocessed 21700000 pages
INFO: Preprocessed 21800000 pages
INFO: Loaded 731987 templates in 3178.9s
INFO: Starting page extraction from /workspace/bert/data/wikipedia/wikicorpus-en.xml.bz2.
INFO: Using 5 extract processes.
Process ForkProcess-2:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.8/site-packages/wikiextractor/WikiExtractor.py", line 484, in extract_process
    Extractor(*job[:-1]).extract(out, html_safe)  # (id, urlbase, title, page)
  File "/opt/conda/lib/python3.8/site-packages/wikiextractor/extract.py", line 976, in extract
    text = self.clean_text(text, html_safe=html_safe)
  File "/opt/conda/lib/python3.8/site-packages/wikiextractor/extract.py", line 963, in clean_text
    text = clean(self, text, expand_templates=expand_templates,
  File "/opt/conda/lib/python3.8/site-packages/wikiextractor/extract.py", line 86, in clean
    text = extractor.expandTemplates(text)
  File "/opt/conda/lib/python3.8/site-packages/wikiextractor/extract.py", line 86, in clean
    text = extractor.expandTemplates(text)
  File "/opt/conda/lib/python3.8/bdb.py", line 88, in trace_dispatch
    return self.dispatch_line(frame)
  File "/opt/conda/lib/python3.8/bdb.py", line 113, in dispatch_line
    if self.quitting: raise BdbQuit
bdb.BdbQuit
Process ForkProcess-4:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.8/site-packages/wikiextractor/WikiExtractor.py", line 484, in extract_process
    Extractor(*job[:-1]).extract(out, html_safe)  # (id, urlbase, title, page)
  File "/opt/conda/lib/python3.8/site-packages/wikiextractor/extract.py", line 976, in extract
    text = self.clean_text(text, html_safe=html_safe)
  File "/opt/conda/lib/python3.8/site-packages/wikiextractor/extract.py", line 963, in clean_text
    text = clean(self, text, expand_templates=expand_templates,
  File "/opt/conda/lib/python3.8/site-packages/wikiextractor/extract.py", line 86, in clean
    text = extractor.expandTemplates(text)
  File "/opt/conda/lib/python3.8/site-packages/wikiextractor/extract.py", line 86, in clean
    text = extractor.expandTemplates(text)
  File "/opt/conda/lib/python3.8/bdb.py", line 88, in trace_dispatch
    return self.dispatch_line(frame)
  File "/opt/conda/lib/python3.8/bdb.py", line 113, in dispatch_line
    if self.quitting: raise BdbQuit
bdb.BdbQuit
Process ForkProcess-3:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.8/site-packages/wikiextractor/WikiExtractor.py", line 484, in extract_process
    Extractor(*job[:-1]).extract(out, html_safe)  # (id, urlbase, title, page)
  File "/opt/conda/lib/python3.8/site-packages/wikiextractor/extract.py", line 976, in extract
    text = self.clean_text(text, html_safe=html_safe)
  File "/opt/conda/lib/python3.8/site-packages/wikiextractor/extract.py", line 963, in clean_text
    text = clean(self, text, expand_templates=expand_templates,
  File "/opt/conda/lib/python3.8/site-packages/wikiextractor/extract.py", line 86, in clean
    text = extractor.expandTemplates(text)
  File "/opt/conda/lib/python3.8/site-packages/wikiextractor/extract.py", line 86, in clean
    text = extractor.expandTemplates(text)
  File "/opt/conda/lib/python3.8/bdb.py", line 88, in trace_dispatch
    return self.dispatch_line(frame)
  File "/opt/conda/lib/python3.8/bdb.py", line 113, in dispatch_line
    if self.quitting: raise BdbQuit
bdb.BdbQuit
Process ForkProcess-6:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.8/site-packages/wikiextractor/WikiExtractor.py", line 484, in extract_process
    Extractor(*job[:-1]).extract(out, html_safe)  # (id, urlbase, title, page)
  File "/opt/conda/lib/python3.8/site-packages/wikiextractor/extract.py", line 976, in extract
    text = self.clean_text(text, html_safe=html_safe)
  File "/opt/conda/lib/python3.8/site-packages/wikiextractor/extract.py", line 963, in clean_text
    text = clean(self, text, expand_templates=expand_templates,
  File "/opt/conda/lib/python3.8/site-packages/wikiextractor/extract.py", line 86, in clean
    text = extractor.expandTemplates(text)
  File "/opt/conda/lib/python3.8/site-packages/wikiextractor/extract.py", line 86, in clean
    text = extractor.expandTemplates(text)
  File "/opt/conda/lib/python3.8/bdb.py", line 88, in trace_dispatch
    return self.dispatch_line(frame)
  File "/opt/conda/lib/python3.8/bdb.py", line 113, in dispatch_line
    if self.quitting: raise BdbQuit
bdb.BdbQuit
Process ForkProcess-5:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.8/site-packages/wikiextractor/WikiExtractor.py", line 484, in extract_process
    Extractor(*job[:-1]).extract(out, html_safe)  # (id, urlbase, title, page)
  File "/opt/conda/lib/python3.8/site-packages/wikiextractor/extract.py", line 976, in extract
    text = self.clean_text(text, html_safe=html_safe)
  File "/opt/conda/lib/python3.8/site-packages/wikiextractor/extract.py", line 963, in clean_text
    text = clean(self, text, expand_templates=expand_templates,
  File "/opt/conda/lib/python3.8/site-packages/wikiextractor/extract.py", line 86, in clean
    text = extractor.expandTemplates(text)
  File "/opt/conda/lib/python3.8/site-packages/wikiextractor/extract.py", line 86, in clean
    text = extractor.expandTemplates(text)
  File "/opt/conda/lib/python3.8/bdb.py", line 88, in trace_dispatch
    return self.dispatch_line(frame)
  File "/opt/conda/lib/python3.8/bdb.py", line 113, in dispatch_line
    if self.quitting: raise BdbQuit
bdb.BdbQuit
**To Reproduce**
Steps to reproduce the behavior:
1. bash scripts/docker/build.sh
2. bash scripts/docker/launch.sh
3. /workspace/bert/data/create_datasets_from_start.sh

Expected behavior
Extracted Wiki-file for lddl preprocessing

I solved this by pip install wikiextractor==3.0.4

Thanks, it has been fixed here: 29f5b7a setting to 3.0.6

3.0.6 doesn't work, 3.0.4 works.

attardi/wikiextractor#283

Hi @Hannibal046 ,

The issue that I filed and you linked is the correct issue that was causing this problem. More specifically, attardi/wikiextractor@05b5cc7#diff-e315b31c6987451055799a11146b20a0d0c86bbd2a26d5d104e6fb9cfa805511R85 was the problematic line of change. Note that, if you check the commit log, this change happened after @attardi created the V3.0.6 tag, which means that installing the wikiextractor from the V3.0.6 tag would be fine (this is what 29f5b7a is about).

I distinctly remembered that when I was developing LDDL, I was using a certain feature from wikiextractor that was only available in 3.0.5+; maybe eventually I decided to not use that feature - this has been a while ago, so my memory is vague. If 3.0.4 works, it's fine; if otherwise, checkout the latest code which has the fix.

Thanks!

GOT IT!

So much THANKS!

BERT TensorFlow2 also has this issue, until attardi/wikiextractor#283 is solved

RUN git clone https://github.com/attardi/wikiextractor.git

workaround until this issue is solved:

        rm -rf /workspace/wikiextractor/
        (cd /workspace; git clone https://github.com/attardi/wikiextractor.git -b v3.0.6)

        bash -x data/create_datasets_from_start.sh