An error occurred while using the training command
Closed this issue · 2 comments
Air1000thsummer commented
I encountered an error while following the training code you provided,
"Python - m torch. distributed. run -- master_port 25764-- nproc_per'node=2 train. py -- exp_id retrain-a6000-- stage 03".
May I ask how to resolve this issue
WARNING:main:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
CUDA Device count: 4
CUDA Device count: 4
Traceback (most recent call last):
File "/home/hyf/WorkPlace/XMemWorkShop/main-XMem/train.py", line 36, in
repo = git.Repo(".")
File "/home/hyf/anaconda3/envs/xmem-repro/lib/python3.9/site-packages/git/repo/base.py", line 276, in init
raise InvalidGitRepositoryError(epath)
git.exc.InvalidGitRepositoryError: /home/hyf/WorkPlace/XMemWorkShop/main-XMem
Traceback (most recent call last):
File "/home/hyf/WorkPlace/XMemWorkShop/main-XMem/train.py", line 36, in
repo = git.Repo(".")
File "/home/hyf/anaconda3/envs/xmem-repro/lib/python3.9/site-packages/git/repo/base.py", line 276, in init
raise InvalidGitRepositoryError(epath)
git.exc.InvalidGitRepositoryError: /home/hyf/WorkPlace/XMemWorkShop/main-XMem
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1066960) of binary: /home/hyf/anaconda3/envs/xmem-repro/bin/python
Traceback (most recent call last):
File "/home/hyf/anaconda3/envs/xmem-repro/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/hyf/anaconda3/envs/xmem-repro/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/hyf/anaconda3/envs/xmem-repro/lib/python3.9/site-packages/torch/distributed/run.py", line 728, in
main()
File "/home/hyf/anaconda3/envs/xmem-repro/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/hyf/anaconda3/envs/xmem-repro/lib/python3.9/site-packages/torch/distributed/run.py", line 724, in main
run(args)
File "/home/hyf/anaconda3/envs/xmem-repro/lib/python3.9/site-packages/torch/distributed/run.py", line 715, in run
elastic_launch(
File "/home/hyf/anaconda3/envs/xmem-repro/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/hyf/anaconda3/envs/xmem-repro/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train.py FAILED
Failures:
[1]:
time : 2024-03-06_20:58:15
host : user-MD72-HB3-00
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 1066961)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure):
[0]:
time : 2024-03-06_20:58:15
host : user-MD72-HB3-00
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1066960)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
hkchengrex commented
You would need to clone the repo (git clone
) or initialize git in the directory.
Air1000thsummer commented
Thank you for your reply, the problem has been solved