CHILD PROCESS FAILED WITH NO ERROR_FILE

Question

CHILD PROCESS FAILED WITH NO ERROR_FILE

masoud-monajati opened this issue 3 years ago · 6 comments

Hi There,
any idea why I'm getting this error by running the training script:
python -m torch.distributed.run --nproc_per_node=8 train.py --task hr_to_lr --k 16384 --test_k 16 --seed 100 --train_seed 1 --use_demonstrations --method channel --n_gpu 8 --batch_size 1 --lr 1e-05 --fp16 --optimization 8bit-adam --out_dir checkpoints/channel-metaicl/hr_to_lr

after successfully running the following script for tensorizing:
python train.py --task hr_to_lr --k 16384 --test_k 16 --seed 100 --use_demonstrations --method channel --do_tensorize --n_gpu 4 --n_process 40

my log file:

           CHILD PROCESS FAILED WITH NO ERROR_FILE

CHILD PROCESS FAILED WITH NO ERROR_FILE
Child process 17057 (local_rank 0) FAILED (exitcode 1)
Error msg: Process failed with exitcode 1
Without writing an error file to <N/A>.
While this DOES NOT affect the correctness of your application,
no trace information about the error will be available for inspection.
Consider decorating your top level entrypoint function with
torch.distributed.elastic.multiprocessing.errors.record. Example:

from torch.distributed.elastic.multiprocessing.errors import record

@record
def trainer_main(args):
# do train

warnings.warn(_no_error_file_warning_msg(rank, failure))
Traceback (most recent call last):
File "/home/monajati/miniconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/monajati/miniconda3/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/monajati/miniconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 637, in
main()
File "/home/monajati/miniconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 629, in main
run(args)
File "/home/monajati/miniconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 621, in run
elastic_launch(
File "/home/monajati/miniconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/monajati/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
File "/home/monajati/miniconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

        train.py FAILED

=======================================
Root Cause:
[0]:
time: 2022-02-04_20:50:02
rank: 0 (local_rank: 0)
exitcode: 1 (pid: 17057)
error_file: <N/A>
msg: "Process failed with exitcode 1"

Other Failures:
[1]:
time: 2022-02-04_20:50:02
rank: 1 (local_rank: 1)
exitcode: 1 (pid: 17058)
error_file: <N/A>
msg: "Process failed with exitcode 1"
[2]:
time: 2022-02-04_20:50:02
rank: 2 (local_rank: 2)
exitcode: 1 (pid: 17059)
error_file: <N/A>
msg: "Process failed with exitcode 1"
[3]:
time: 2022-02-04_20:50:02
rank: 3 (local_rank: 3)
exitcode: 1 (pid: 17060)
error_file: <N/A>
msg: "Process failed with exitcode 1"
[4]:
time: 2022-02-04_20:50:02
rank: 4 (local_rank: 4)
exitcode: 1 (pid: 17061)
error_file: <N/A>
msg: "Process failed with exitcode 1"
[5]:
time: 2022-02-04_20:50:02
rank: 5 (local_rank: 5)
exitcode: 1 (pid: 17062)
error_file: <N/A>
msg: "Process failed with exitcode 1"
[6]:
time: 2022-02-04_20:50:02
rank: 6 (local_rank: 6)
exitcode: 1 (pid: 17063)
error_file: <N/A>
msg: "Process failed with exitcode 1"
[7]:
time: 2022-02-04_20:50:02
rank: 7 (local_rank: 7)
exitcode: 1 (pid: 17064)
error_file: <N/A>
msg: "Process failed with exitcode 1"

Answer 1 · 2022-02-05T05:06:06.000Z

Hi @monajati,

Hmm, I am not sure about this specific error. It seems like there're related discussions: ultralytics/yolov5#3897 and https://discuss.pytorch.org/t/having-childfailederror/129464/3, so maybe you can take a look?

Besides, there is one problem with the script: you specified n_gpu=4 for tensorizing, and are using n_gpu=8 for actual training. n_gpu should be the same for training and tensorizing. I recommend deleting the tensorized directory, running tensorizing again with n_gpu=8, and rerunning the training script.

Answer 2 · 2022-02-05T06:37:12.000Z

sorry. that was a typo. trying both scripts as n_gpt=8, I'm getting the same error. I'll check the link you shared. Thanks a lot

Answer 3 · 2022-02-11T21:28:08.000Z

Hi @shmsw25,

It seems I missed the error message in the screenshot of my log.
Here is the error message:

RuntimeError: Timed out initializing process group in store based barrier on rank: 3, for key: store_based_barrier_key:1 (world_size=6, worker_count=24, timeout=0:30:00)

Answer 4 · 2022-02-15T13:22:47.000Z

This error occurs when there is an error in one of the subprocesses. I should be able to help better if you could give the entire error message as well as your hardware spec (e.g. GPU memory).

Answer 5 · 2022-03-01T21:54:28.000Z

Closing this issue since the thread is inactive.

Answer 6 · 2023-03-28T06:55:21.000Z

oh, I missed it too. Thanks for your remind.

======================================= Root Cause: [0]: time: 2022-02-04_20:50:02 rank: 0 (local_rank: 0) exitcode: 1 (pid: 17057) error_file: <N/A> msg: "Process failed with exitcode 1"

=======================================
Root Cause:
[0]:
time: 2022-02-04_20:50:02
rank: 0 (local_rank: 0)
exitcode: 1 (pid: 17057)
error_file: <N/A>
msg: "Process failed with exitcode 1"