CHILD PROCESS FAILED WITH NO ERROR_FILE
masoud-monajati opened this issue · 6 comments
Hi There,
any idea why I'm getting this error by running the training script:
python -m torch.distributed.run --nproc_per_node=8 train.py --task hr_to_lr --k 16384 --test_k 16 --seed 100 --train_seed 1 --use_demonstrations --method channel --n_gpu 8 --batch_size 1 --lr 1e-05 --fp16 --optimization 8bit-adam --out_dir checkpoints/channel-metaicl/hr_to_lr
after successfully running the following script for tensorizing:
python train.py --task hr_to_lr --k 16384 --test_k 16 --seed 100 --use_demonstrations --method channel --do_tensorize --n_gpu 4 --n_process 40
my log file:
CHILD PROCESS FAILED WITH NO ERROR_FILE
CHILD PROCESS FAILED WITH NO ERROR_FILE
Child process 17057 (local_rank 0) FAILED (exitcode 1)
Error msg: Process failed with exitcode 1
Without writing an error file to <N/A>.
While this DOES NOT affect the correctness of your application,
no trace information about the error will be available for inspection.
Consider decorating your top level entrypoint function with
torch.distributed.elastic.multiprocessing.errors.record. Example:
from torch.distributed.elastic.multiprocessing.errors import record
@record
def trainer_main(args):
# do train
warnings.warn(_no_error_file_warning_msg(rank, failure))
Traceback (most recent call last):
File "/home/monajati/miniconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/monajati/miniconda3/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/monajati/miniconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 637, in
main()
File "/home/monajati/miniconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 629, in main
run(args)
File "/home/monajati/miniconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 621, in run
elastic_launch(
File "/home/monajati/miniconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/monajati/miniconda3/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
File "/home/monajati/miniconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train.py FAILED
=======================================
Root Cause:
[0]:
time: 2022-02-04_20:50:02
rank: 0 (local_rank: 0)
exitcode: 1 (pid: 17057)
error_file: <N/A>
msg: "Process failed with exitcode 1"
Other Failures:
[1]:
time: 2022-02-04_20:50:02
rank: 1 (local_rank: 1)
exitcode: 1 (pid: 17058)
error_file: <N/A>
msg: "Process failed with exitcode 1"
[2]:
time: 2022-02-04_20:50:02
rank: 2 (local_rank: 2)
exitcode: 1 (pid: 17059)
error_file: <N/A>
msg: "Process failed with exitcode 1"
[3]:
time: 2022-02-04_20:50:02
rank: 3 (local_rank: 3)
exitcode: 1 (pid: 17060)
error_file: <N/A>
msg: "Process failed with exitcode 1"
[4]:
time: 2022-02-04_20:50:02
rank: 4 (local_rank: 4)
exitcode: 1 (pid: 17061)
error_file: <N/A>
msg: "Process failed with exitcode 1"
[5]:
time: 2022-02-04_20:50:02
rank: 5 (local_rank: 5)
exitcode: 1 (pid: 17062)
error_file: <N/A>
msg: "Process failed with exitcode 1"
[6]:
time: 2022-02-04_20:50:02
rank: 6 (local_rank: 6)
exitcode: 1 (pid: 17063)
error_file: <N/A>
msg: "Process failed with exitcode 1"
[7]:
time: 2022-02-04_20:50:02
rank: 7 (local_rank: 7)
exitcode: 1 (pid: 17064)
error_file: <N/A>
msg: "Process failed with exitcode 1"
Hi @monajati,
Hmm, I am not sure about this specific error. It seems like there're related discussions: ultralytics/yolov5#3897 and https://discuss.pytorch.org/t/having-childfailederror/129464/3, so maybe you can take a look?
Besides, there is one problem with the script: you specified n_gpu=4
for tensorizing, and are using n_gpu=8
for actual training. n_gpu
should be the same for training and tensorizing. I recommend deleting the tensorized
directory, running tensorizing again with n_gpu=8
, and rerunning the training script.
sorry. that was a typo. trying both scripts as n_gpt=8, I'm getting the same error. I'll check the link you shared. Thanks a lot
Hi @shmsw25,
It seems I missed the error message in the screenshot of my log.
Here is the error message:
RuntimeError: Timed out initializing process group in store based barrier on rank: 3, for key: store_based_barrier_key:1 (world_size=6, worker_count=24, timeout=0:30:00)
This error occurs when there is an error in one of the subprocesses. I should be able to help better if you could give the entire error message as well as your hardware spec (e.g. GPU memory).
Closing this issue since the thread is inactive.
oh, I missed it too. Thanks for your remind.