Issue with Training on one GPU
Opened this issue · 13 comments
Hello, thank you for the code availability. While trying to run the train code, I have Benn stuck with the distributed Training part. I have only one GPU available and the code requires having configuration related to DDP. The error is within "rank = int(os.environ["RANK"])". Is there any way to run it on one single GPU ? Thank you in advance
Hello!
Normally, our code can run on a single GPU by setting nproc_per_node=1, for example:
$ python -m torch.distributed.launch --nproc_per_node=1 train_diffusion.py --dataset MVTec-AD
If you still can't run it with the above command, please try to provide more error information.
I'm not sure if this issue is caused by the GPU and DDP. If you have defined your own dataset, you need to add your custom dataset to the choices in the arguments. Alternatively, you can paste the complete error message here.
Sorry, I don't know the exact cause of your error; it could be due to various reasons. You may refer to the following:
Vision-CAIR/MiniGPT-4#237
- Update lower version of torch. I haven't tested the code on torch 2.x.
- Reduce the batch size. I'm not sure if it will be effective.
Hello, Thank you for the insights. I have resolved the issue but I still getting another error with CUDA Memory exceeded. Is there any minimal requirement for the GPU ?
Training diffusion models requires relatively high GPU memory. It is recommended to use a 48G (or larger) GPU, allowing for a batch size of 5. If you only have a 24G GPU, you can only set the batch size to 1. If you only want to reproduce the experimental results, you can use the checkpoints I provided.
The checkpoints for the Diffusion and classifier are available but when I wanted to reproduce the results of RealNet, I got an error of a missing checkpoint. [Errno 2] No such file or directory: 'experiments/MVTec-AD/realnet_checkpoints/bottle/ckpt_best.pth.tar'. In this case, I thought of training the realnet network with the following command : $ python -m torch.distributed.launch --nproc_per_node=1 train_realnet.py --dataset MVTec-AD --class_name bottle. And at that level, I got the CUDA Memory issue
The checkpoints for the Diffusion and classifier are available but when I wanted to reproduce the results of RealNet, I got an error of a missing checkpoint. [Errno 2] No such file or directory: 'experiments/MVTec-AD/realnet_checkpoints/bottle/ckpt_best.pth.tar'. In this case, I thought of training the realnet network with the following command : $ python -m torch.distributed.launch --nproc_per_node=1 train_realnet.py --dataset MVTec-AD --class_name bottle. And at that level, I got the CUDA Memory issue
The training of RealNet only requires 24GB of GPU memory.
Hello, Thank you for the insights. I have resolved the issue but I still getting another error with CUDA Memory exceeded. Is there any minimal requirement for the GPU ?您好,感谢您的见解。我已经解决了该问题,但仍然收到另一个错误:CUDA 内存超出。对 GPU 有最低要求吗?
你好,我遇到了和你相同的问题torch版本2.3,cuda版本11.8,请问你是降低了torch版本解决的吗?如果是我应该降到哪个版本?
Hello, Thank you for the insights. I have resolved the issue but I still getting another error with CUDA Memory exceeded. Is there any minimal requirement for the GPU ?您好,感谢您的见解。我已经解决了该问题,但仍然收到另一个错误:CUDA 内存超出。对 GPU 有最低要求吗?
你好,我遇到了和你相同的问题torch版本2.3,cuda版本11.8,请问你是降低了torch版本解决的吗?如果是我应该降到哪个版本?
这个代码没有在torch 2.x 上测试过,我使用的是torch1.11
Hello, Thank you for the insights. I have resolved the issue but I still getting another error with CUDA Memory exceeded. Is there any minimal requirement for the GPU ?您好,感谢您的见解。我已经解决了该问题,但仍然收到另一个错误:CUDA 内存超出。对 GPU 有最低要求吗?
你好,我遇到了和你相同的问题torch版本2.3,cuda版本11.8,请问你是降低了torch版本解决的吗?如果是我应该降到哪个版本?
你好,我遇到了和你同样的问题,请问你是如何解决的呢?
您好,感谢您的见解。我已经解决了该问题,但仍然收到另一个错误:CUDA 内存超出。对 GPU 有最低要求吗?您好,感谢您的意见。我已经解决了该问题,但仍然收到另一个错误:CUDA 内存超出。对 GPU 有最低要求吗?
你好,我遇到了和你一样的问题torch版本2.3,cuda版本11.8,请问你是降了torch版本解决的吗?如果是我应该降到哪个版本?
您好,我遇到了和您同样的问题,请问您如何解决的呢?
parser.add_argument("--local-rank", default=-1, type=int)