environment version

Question

environment version

pengjunxing opened this issue 7 months ago · 2 comments

May I ask if you have replicated the training project, my friend? I would like to inquire about your environment version and what version of PyTorch, CUDA... you are using;

I encountered an issue while attempting to replicate the project:
(RIFE) gzdx@gzdx-Super-Server:~/RIFE$ python3 -m torch.distributed.launch --nproc_per_node=4 train.py --world_size=2
/home/gzdx/anaconda3/envs/RIFE/lib/python3.8/site-packages/torch/distributed/launch.py:183: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects --local-rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

warnings.warn(
W0720 01:05:05.369797 140192998098752 torch/distributed/run.py:757]
W0720 01:05:05.369797 140192998098752 torch/distributed/run.py:757] *****
W0720 01:05:05.369797 140192998098752 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0720 01:05:05.369797 140192998098752 torch/distributed/run.py:757] *
usage: train.py [-h] [--epoch EPOCH] [--batch_size BATCH_SIZE] [--local_rank LOCAL_RANK] [--world_size WORLD_SIZE]
train.py: error: unrecognized arguments: --local-rank=3
usage: train.py [-h] [--epoch EPOCH] [--batch_size BATCH_SIZE] [--local_rank LOCAL_RANK] [--world_size WORLD_SIZE]
train.py: error: unrecognized arguments: --local-rank=1
usage: train.py [-h] [--epoch EPOCH] [--batch_size BATCH_SIZE] [--local_rank LOCAL_RANK] [--world_size WORLD_SIZE]
train.py: error: unrecognized arguments: --local-rank=2
usage: train.py [-h] [--epoch EPOCH] [--batch_size BATCH_SIZE] [--local_rank LOCAL_RANK] [--world_size WORLD_SIZE]
train.py: error: unrecognized arguments: --local-rank=0
E0720 01:05:10.407284 140192998098752 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 2) local_rank: 0 (pid: 301895) of binary: /home/gzdx/anaconda3/envs/RIFE/bin/python3
Traceback (most recent call last):
File "/home/gzdx/anaconda3/envs/RIFE/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/gzdx/anaconda3/envs/RIFE/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/gzdx/anaconda3/envs/RIFE/lib/python3.8/site-packages/torch/distributed/launch.py", line 198, in
main()
File "/home/gzdx/anaconda3/envs/RIFE/lib/python3.8/site-packages/torch/distributed/launch.py", line 194, in main
launch(args)
File "/home/gzdx/anaconda3/envs/RIFE/lib/python3.8/site-packages/torch/distributed/launch.py", line 179, in launch
run(args)
File "/home/gzdx/anaconda3/envs/RIFE/lib/python3.8/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/gzdx/anaconda3/envs/RIFE/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/gzdx/anaconda3/envs/RIFE/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:
[1]:
time : 2024-07-20_01:05:10
host : gzdx-Super-Server
rank : 1 (local_rank: 1)
exitcode : 2 (pid: 301896)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-07-20_01:05:10

Answer 1 · 2024-07-22T03:16:52.000Z

As I understand, "nproc_per_node = world_size = gpu number of your environment" may be a available setting.

Answer 2 · 2024-07-23T06:41:52.000Z

Thanks, bro. My problem has been solved.