Recommended GPU / GPU memory
Opened this issue · 4 comments
Hello,
can anyone share his/her experience what GPU (with how much memory) is at least needed to train a deeplab model based on this implementation?
Thank you very much :)
I have tried the training on pascal voc with batch size of 2, it requires almost 16 GB memory on GPU and for batch size 3, 18.5 GB. with batch size= it requires almost 32 GB.
Hi @blaskowitz100 , thank you for your interest in the repo.
In my case, I was able to train a model for PASCAL VOC dataset with batch size of 9 using GTX 1080Ti and 16 GB of main memory, and obtain decent result.
If you have lower performance GPU, you can try reducing batch size and check if the model is trained. Regarding main memory, 16GB is enough, at least for my case.
I hope this answers your question.
Hi @rishizek , can i ask you more details?
i'm trying to train deeplabv3+ but my batch size is too small
my current setting and envrionment is
input size - 512, 512, 3
backbone - xception, OS16
gpu - titan xp (12GB)
--> batch size - 4
i think 4 is too small to train my own dataset,,
in your above experiment, what is your settings for backbone model and input size?
thank you for your help in advance!
So I'm guessing, to train PASCAL VOC 2012 as shown in the script local_test.sh there are not many options regarding your graphics card. In the example, using the starting checkpoint for DeepLabv3 - xception65, doesn't matter how you play with the hyperparameters, not even for batch_size==1, you won't get results with a 6GB GPU.
Furthermore, I'm in the same situation when using the MobilenetV2 script local_test_mobilenetv2.sh and its starting checkpoint
NVIDIA lineup of cards of more than 8GB of dedicated GPU memory (as shared GPU memory is not used in Tensorflow) is pretty thin. Not even a GTX 1080 (8 GB) would do. You need to go for a GTX 1080 Ti (11GB), a Titan, or a RTX 2080 Ti (the regular RTX 2080 is only 8GB)
Update:
Trying to start the training process using the above examples result in several CUDA memory errors
tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
and the program exiting with error code 1. I can see in the task manager how the memory is occupied to full capacity.
But after tinkering with the object GPUOptions, using the following lines
gpu_options = train.tf.GPUOptions(per_process_gpu_memory_fraction=0.5)
sess = train.tf.Session(config=train.tf.ConfigProto(gpu_options=gpu_options))
the script throws several warnings regarding performance considerations
tensorflow/core/common_runtime/bfc_allocator.cc:239] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.93GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
but the training process starts succesfully.