error when prepare_data
ZeroneBo opened this issue · 2 comments
ZeroneBo commented
when I run prepare_data pipline, the error is reported:
100%|█████████▉| 2753/2758 [04:10<00:00, 48.56it/s]
100%|█████████▉| 2756/2758 [04:11<00:00, 48.56it/s][2023-07-08 13:56:10,195][stopes.launcher][INFO] - for DedupSharding_74b624bdfbf66a3f45d60ba07069a425735015733297896f8d469f2934ff9be1 found 589 already cached array results,2 left to compute out of 591
100%|█████████▉| 2756/2758 [04:11<00:00, 48.56it/s][2023-07-08 13:56:10,211][stopes.launcher][INFO] - submitted job array for DedupSharding_74b624bdfbf66a3f45d60ba07069a425735015733297896f8d469f2934ff9be1: ['25693', '25739']
100%|█████████▉| 2756/2758 [04:28<00:00, 48.56it/s]
100%|█████████▉| 2757/2758 [30:59<01:27, 87.93s/it]
100%|██████████| 2758/2758 [31:29<00:00, 83.74s/it]
100%|██████████| 2758/2758 [31:29<00:00, 83.74s/it][2023-07-08 14:23:28,254][stopes.jobs][INFO] - Jobs progress: {'Completed': 2, 'Total': 2}
31%|███ | 2758/8971 [31:31<144:31:38, 83.74s/it]Traceback (most recent call last):
File "/home/bli/pkgs/miniconda3/envs/torch-2/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/bli/pkgs/miniconda3/envs/torch-2/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/bli/pkgs/miniconda3/envs/torch-2/lib/python3.9/site-packages/submitit/local/_local.py", line 16, in <module>
controller.run()
File "/home/bli/pkgs/miniconda3/envs/torch-2/lib/python3.9/site-packages/submitit/local/local.py", line 327, in run
self.start_tasks()
File "/home/bli/pkgs/miniconda3/envs/torch-2/lib/python3.9/site-packages/submitit/local/local.py", line 272, in start_tasks
subprocess.Popen( # pylint: disable=consider-using-with
File "/home/bli/pkgs/miniconda3/envs/torch-2/lib/python3.9/subprocess.py", line 951, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "/home/bli/pkgs/miniconda3/envs/torch-2/lib/python3.9/subprocess.py", line 1754, in _execute_child
self.pid = _posixsubprocess.fork_exec(
BlockingIOError: [Errno 11] Resource temporarily unavailable
18%|█▊ | 2758/15184 [32:53<289:03:17, 83.74s/it]
18%|█▊ | 2758/15636 [58:41<299:34:09, 83.74s/it]
17%|█▋ | 2758/16088 [59:00<310:05:01, 83.74s/it]Exception ignored in: <function Job.__del__ at 0x7feeb23f9e50>
Traceback (most recent call last):
File "/home/bli/pkgs/miniconda3/envs/torch-2/lib/python3.9/site-packages/submitit/core/core.py", line 512, in __del__
if not self.watcher.is_done(self.job_id, mode="cache"):
File "/home/bli/pkgs/miniconda3/envs/torch-2/lib/python3.9/site-packages/submitit/core/core.py", line 550, in __getattr__
raise AttributeError(
AttributeError: Accesssing job attributes is forbidden within 'with executor.batch()' context
Exception ignored in: <function Job.__del__ at 0x7feeb23f9e50>
.
.
.
Error executing job with overrides: ['output_dir=/home/bli/blidata/models/dataset/nllb_primary/prepared_data']
Traceback (most recent call last):
File "/data/bli/pylib/stopes/stopes/pipelines/prepare_data/prepare_data.py", line 140, in main
asyncio.run(pipeline.run())
File "/home/bli/pkgs/miniconda3/envs/torch-2/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/home/bli/pkgs/miniconda3/envs/torch-2/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
return future.result()
File "/data/bli/pylib/stopes/stopes/pipelines/prepare_data/prepare_data.py", line 75, in run
await binarize(
File "/data/bli/pylib/stopes/stopes/pipelines/prepare_data/binarize.py", line 85, in binarize
_, _, src_eval_binarized, tgt_eval_binarized = await asyncio.gather(
File "/data/bli/pylib/stopes/stopes/core/launcher.py", line 203, in schedule
result = await self._schedule_array(module, value_array)
File "/data/bli/pylib/stopes/stopes/core/launcher.py", line 335, in _schedule_array
task = task.waiting_on_job(job)
File "/home/bli/pkgs/miniconda3/envs/torch-2/lib/python3.9/contextlib.py", line 126, in __exit__
next(self.gen)
File "/home/bli/pkgs/miniconda3/envs/torch-2/lib/python3.9/site-packages/submitit/core/core.py", line 710, in batch
self._submit_delayed_batch()
File "/home/bli/pkgs/miniconda3/envs/torch-2/lib/python3.9/site-packages/submitit/core/core.py", line 723, in _submit_delayed_batch
new_jobs = self._internal_process_submissions(submissions)
File "/home/bli/pkgs/miniconda3/envs/torch-2/lib/python3.9/site-packages/submitit/auto/auto.py", line 218, in _internal_process_submissions
return self._executor._internal_process_submissions(delayed_submissions)
File "/home/bli/pkgs/miniconda3/envs/torch-2/lib/python3.9/site-packages/submitit/core/core.py", line 893, in _internal_process_submissions
job = self._submit_command(self._submitit_command_str)
File "/home/bli/pkgs/miniconda3/envs/torch-2/lib/python3.9/site-packages/submitit/local/local.py", line 163, in _submit_command
process = start_controller(
File "/home/bli/pkgs/miniconda3/envs/torch-2/lib/python3.9/site-packages/submitit/local/local.py", line 225, in start_controller
process = subprocess.Popen(
File "/home/bli/pkgs/miniconda3/envs/torch-2/lib/python3.9/subprocess.py", line 951, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "/home/bli/pkgs/miniconda3/envs/torch-2/lib/python3.9/subprocess.py", line 1754, in _execute_child
self.pid = _posixsubprocess.fork_exec(
BlockingIOError: [Errno 11] Resource temporarily unavailable
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
my config is like this :
prepare_data.yaml
defaults:
- launcher: local
- preprocessing: default
- vocab: default
- dedup: neither
- sharding: default
- _self_
output_dir: ${output_dir}
launcher:
partition: null # set as null if running locally
cache:
caching_dir: ${output_dir}/cache # Cache won't be re-used if you change the output_dir.
corpora:
train:
aar-amh_Ethi:
hornmt:
src: /home/bli/blidata/models/dataset/nllb_primary/primary_filted/train_primary/aar-amh_Ethi/hornmt.aar.gz
tgt: /home/bli/blidata/models/dataset/nllb_primary/primary_filted/train_primary/aar-amh_Ethi/hornmt.amh_Ethi.gz
valid:
ace_Arab-eng_Latn:
flores200:
src: /home/bli/blidata/models/dataset/nllb_primary/flores200/dev/ace_Arab-eng_Latn/flores200.ace_Arab
tgt: /home/bli/blidata/models/dataset/nllb_primary/flores200/dev/ace_Arab-eng_Latn/flores200.eng_Latn
test:
ace_Arab-eng_Latn:
flores200:
src: /home/bli/blidata/models/dataset/nllb_primary/flores200/devtest/ace_Arab-eng_Latn/flores200.ace_Arab
tgt: /home/bli/blidata/models/dataset/nllb_primary/flores200/devtest/ace_Arab-eng_Latn/flores200.eng_Latn
launcher/local.yaml:
defaults:
- cache: file_cache
_target_: stopes.core.Launcher
log_folder: executor_logs
cluster: local
partition: null
max_jobarray_jobs: 8
launch/submitit.yaml:
defaults:
- cache: file_cache
_target_: stopes.core.Launcher
log_folder: executor_logs
cluster: slurm
partition: null
max_jobarray_jobs: 8
sharding/default.yaml
max_examples_per_shard: 100_000
smallest_shard: 25000
binarize_num_workers: 30
other config is the same as the default.
And I find out before error occurs, the CPU and memory can be 100%:
I want to know how to preprocess filted data or run prepare_data.py correctly.
Quickly reply is very helpful for me and thanks !
gordicaleksa commented
Did you figure it out?
ZeroneBo commented
@gordicaleksa No, I gave up using stopes.