allenai/dolma

make_wikipedia.py fails on linux

peterbjorgensen opened this issue · 10 comments

Traceback (most recent call last):
  File "/home/peter/kode/dolma/dolma_env/lib/python3.11/site-packages/dolma/core/parallel.py", line 283, in _multiprocessing_run_all
    multiprocessing.set_start_method("spawn")
  File "/usr/lib/python3.11/multiprocessing/context.py", line 247, in set_start_method
    raise RuntimeError('context has already been set')
RuntimeError: context has already been set

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/peter/kode/dolma/scripts/make_wikipedia.py", line 289, in <module>
    main()
  File "/home/peter/kode/dolma/scripts/make_wikipedia.py", line 285, in main
    processor(date=args.date, lang=args.lang)
  File "/home/peter/kode/dolma/dolma_env/lib/python3.11/site-packages/dolma/core/parallel.py", line 390, in __call__
    fn(
  File "/home/peter/kode/dolma/dolma_env/lib/python3.11/site-packages/dolma/core/parallel.py", line 285, in _multiprocessing_run_all
    assert multiprocessing.get_start_method() == "spawn", "Multiprocessing start method must be spawn"
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Multiprocessing start method must be spawn

The bug can be fixed by setting
multiprocessing.set_start_method("spawn")
in the __main__ environment.

Perhaps the dolma core/parallel.py should use multiprocessing.get_context("spawn") to avoid this.

Once this is fixed I also get the following error:

files: 0.00f [03:30, ?f/s]        2023-10-18 09:34:09,483 WARNING dolma.WikiExtractorParallel Failed to process wikipedia_simple/wiki_20231001_simple/AA/wiki_00.gz: Error -3 while decompressing data: invalid stored block lengths

This is the command I use for running it:
python scripts/make_wikipedia.py --output wikipedia_simple --lang simple --processes 4

soldni commented

Hi @peterbjorgensen! thank you for this bug report. I've made a PR (#64) with these fixes in.

I can't seem to reproduce the error gzip... could you tell me a bit more about your setup (platform, python version, etc.)

I am on Python 3.11.5 on fully updated Arch Linux, wikiextractor-3.0.7.
It seems like it makes an incomplete wiki_00.gz archive of 70 MB.
I can't gunzip wiki_00.gz either - I get gzip: wiki_00.gz: invalid compressed data--format violated

Even using Python 3.11.8 , the error is the same as follows:
Found 1 files to process
files: 0.00f [04:25, ?f/s][2024-02-29 16:53:56 SpawnPoolWorker-32.dolma.WikiExtractorParallel WARNING] Failed to process wikipedia/wiki_20231001_simple/AA/wiki_00.gz: Error -3 while decompressing data: invalid block type
documents: 239kd [04:26, 897d/s]
files: 0.00f [04:26, ?f/s]. gunzip wiki_00.gz error makes me not able to follow the taggers step

Even using Python 3.11.8 , the error is the same as follows: Found 1 files to process files: 0.00f [04:25, ?f/s][2024-02-29 16:53:56 SpawnPoolWorker-32.dolma.WikiExtractorParallel WARNING] Failed to process wikipedia/wiki_20231001_simple/AA/wiki_00.gz: Error -3 while decompressing data: invalid block type documents: 239kd [04:26, 897d/s] files: 0.00f [04:26, ?f/s]. gunzip wiki_00.gz error makes me not able to follow the taggers step

@soldni I think this needs to be fixed, please check it.

I remain unable to reproduce this issue on my side, would need more info.

@soldni
I'm also get the bug:
python scripts/make_wikipedia.py --output ./wikipedia_zh --date 20240401 --lang zh --process 1
Found 1 files to process
files: 0.00f [1:00:20, ?f/s] [2024-04-06 22:45:55 SpawnPoolWorker-3.dolma.WikiExtractorParallel WARNING] Failed to process wikipedia_zh/wiki_20240401_zh/AA/wiki_00.gz: Error -3 while decompressing data: invalid block type
documents: 1.40Md [1:00:20, 387d/s]

wikiextractor : 3.0.6

I update wikiextractor from 3.0.6 to 3.0.7,solve the bug Error -3 while decompressing data: invalid block type. But get : Error -3 while decompressing data: invalid stored block lengths

I update wikiextractor from 3.0.6 to 3.0.7,solve the bug Error -3 while decompressing data: invalid block type. But get : Error -3 while decompressing data: invalid stored block lengths

Have you solved this problem, i faced this problem, and i don't have the chance to follow tagger step.