make_wikipedia.py fails on linux
peterbjorgensen opened this issue · 10 comments
Traceback (most recent call last):
File "/home/peter/kode/dolma/dolma_env/lib/python3.11/site-packages/dolma/core/parallel.py", line 283, in _multiprocessing_run_all
multiprocessing.set_start_method("spawn")
File "/usr/lib/python3.11/multiprocessing/context.py", line 247, in set_start_method
raise RuntimeError('context has already been set')
RuntimeError: context has already been set
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/peter/kode/dolma/scripts/make_wikipedia.py", line 289, in <module>
main()
File "/home/peter/kode/dolma/scripts/make_wikipedia.py", line 285, in main
processor(date=args.date, lang=args.lang)
File "/home/peter/kode/dolma/dolma_env/lib/python3.11/site-packages/dolma/core/parallel.py", line 390, in __call__
fn(
File "/home/peter/kode/dolma/dolma_env/lib/python3.11/site-packages/dolma/core/parallel.py", line 285, in _multiprocessing_run_all
assert multiprocessing.get_start_method() == "spawn", "Multiprocessing start method must be spawn"
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Multiprocessing start method must be spawn
The bug can be fixed by setting
multiprocessing.set_start_method("spawn")
in the __main__
environment.
Perhaps the dolma core/parallel.py should use multiprocessing.get_context("spawn")
to avoid this.
Once this is fixed I also get the following error:
files: 0.00f [03:30, ?f/s] 2023-10-18 09:34:09,483 WARNING dolma.WikiExtractorParallel Failed to process wikipedia_simple/wiki_20231001_simple/AA/wiki_00.gz: Error -3 while decompressing data: invalid stored block lengths
This is the command I use for running it:
python scripts/make_wikipedia.py --output wikipedia_simple --lang simple --processes 4
Hi @peterbjorgensen! thank you for this bug report. I've made a PR (#64) with these fixes in.
I can't seem to reproduce the error gzip... could you tell me a bit more about your setup (platform, python version, etc.)
I am on Python 3.11.5
on fully updated Arch Linux, wikiextractor-3.0.7
.
It seems like it makes an incomplete wiki_00.gz
archive of 70 MB.
I can't gunzip wiki_00.gz
either - I get gzip: wiki_00.gz: invalid compressed data--format violated
Even using Python 3.11.8 , the error is the same as follows:
Found 1 files to process
files: 0.00f [04:25, ?f/s][2024-02-29 16:53:56 SpawnPoolWorker-32.dolma.WikiExtractorParallel WARNING] Failed to process wikipedia/wiki_20231001_simple/AA/wiki_00.gz: Error -3 while decompressing data: invalid block type
documents: 239kd [04:26, 897d/s]
files: 0.00f [04:26, ?f/s]. gunzip wiki_00.gz error makes me not able to follow the taggers step
Even using Python 3.11.8 , the error is the same as follows: Found 1 files to process files: 0.00f [04:25, ?f/s][2024-02-29 16:53:56 SpawnPoolWorker-32.dolma.WikiExtractorParallel WARNING] Failed to process wikipedia/wiki_20231001_simple/AA/wiki_00.gz: Error -3 while decompressing data: invalid block type documents: 239kd [04:26, 897d/s] files: 0.00f [04:26, ?f/s]. gunzip wiki_00.gz error makes me not able to follow the taggers step
@soldni I think this needs to be fixed, please check it.
I remain unable to reproduce this issue on my side, would need more info.
@soldni
I'm also get the bug:
python scripts/make_wikipedia.py --output ./wikipedia_zh --date 20240401 --lang zh --process 1
Found 1 files to process
files: 0.00f [1:00:20, ?f/s] [2024-04-06 22:45:55 SpawnPoolWorker-3.dolma.WikiExtractorParallel WARNING] Failed to process wikipedia_zh/wiki_20240401_zh/AA/wiki_00.gz: Error -3 while decompressing data: invalid block type
documents: 1.40Md [1:00:20, 387d/s]
wikiextractor : 3.0.6
I update wikiextractor from 3.0.6 to 3.0.7,solve the bug Error -3 while decompressing data: invalid block type. But get : Error -3 while decompressing data: invalid stored block lengths
I update wikiextractor from 3.0.6 to 3.0.7,solve the bug Error -3 while decompressing data: invalid block type. But get : Error -3 while decompressing data: invalid stored block lengths
Have you solved this problem, i faced this problem, and i don't have the chance to follow tagger step.