An error occurred while using the training command

Question

An error occurred while using the training command

Closed this issue 10 months ago · 2 comments

I encountered an error while following the training code you provided,

"Python - m torch. distributed. run -- master_port 25764-- nproc_per'node=2 train. py -- exp_id retrain-a6000-- stage 03".

May I ask how to resolve this issue

WARNING:main:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

CUDA Device count: 4
CUDA Device count: 4
Traceback (most recent call last):
File "/home/hyf/WorkPlace/XMemWorkShop/main-XMem/train.py", line 36, in
repo = git.Repo(".")
File "/home/hyf/anaconda3/envs/xmem-repro/lib/python3.9/site-packages/git/repo/base.py", line 276, in init
raise InvalidGitRepositoryError(epath)
git.exc.InvalidGitRepositoryError: /home/hyf/WorkPlace/XMemWorkShop/main-XMem
Traceback (most recent call last):
File "/home/hyf/WorkPlace/XMemWorkShop/main-XMem/train.py", line 36, in
repo = git.Repo(".")
File "/home/hyf/anaconda3/envs/xmem-repro/lib/python3.9/site-packages/git/repo/base.py", line 276, in init
raise InvalidGitRepositoryError(epath)
git.exc.InvalidGitRepositoryError: /home/hyf/WorkPlace/XMemWorkShop/main-XMem
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1066960) of binary: /home/hyf/anaconda3/envs/xmem-repro/bin/python
Traceback (most recent call last):
File "/home/hyf/anaconda3/envs/xmem-repro/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/hyf/anaconda3/envs/xmem-repro/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/hyf/anaconda3/envs/xmem-repro/lib/python3.9/site-packages/torch/distributed/run.py", line 728, in
main()
File "/home/hyf/anaconda3/envs/xmem-repro/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, kwargs)
File "/home/hyf/anaconda3/envs/xmem-repro/lib/python3.9/site-packages/torch/distributed/run.py", line 724, in main
run(args)
File "/home/hyf/anaconda3/envs/xmem-repro/lib/python3.9/site-packages/torch/distributed/run.py", line 715, in run
elastic_launch(
File "/home/hyf/anaconda3/envs/xmem-repro/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/hyf/anaconda3/envs/xmem-repro/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:
[1]:
time : 2024-03-06_20:58:15
host : user-MD72-HB3-00
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 1066961)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2024-03-06_20:58:15
host : user-MD72-HB3-00
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1066960)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Answer 1 · 2024-03-06T17:45:19.000Z

You would need to clone the repo (git clone) or initialize git in the directory.

Answer 2 · 2024-03-07T02:00:32.000Z

Thank you for your reply, the problem has been solved

train.py FAILED

Failures: [1]: time : 2024-03-06_20:58:15 host : user-MD72-HB3-00 rank : 1 (local_rank: 1) exitcode : 1 (pid: 1066961) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2024-03-06_20:58:15 host : user-MD72-HB3-00 rank : 0 (local_rank: 0) exitcode : 1 (pid: 1066960) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Failures:
[1]:
time : 2024-03-06_20:58:15
host : user-MD72-HB3-00
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 1066961)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2024-03-06_20:58:15
host : user-MD72-HB3-00
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1066960)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html