LabeliaLabs/distributed-learning-contributivity

Does the library benefit from GPU ?

arthurPignet opened this issue · 8 comments

As we use the tensorflow backend only for a few Keras' epochs (gradient_pass), it is not obvious that we really benefit from the tensorflow backend on gpu.

It is really an open question, maybe the tensorflow backend is used by numpy and so on to make all the computation we need.

If configured properly, we do benefit from the GPU acceleration.

From my own experience, most of the time is taken by the fit() method and so using the GPU makes a huge difference.

I looked into this question, which I've asked myself in the past on a recurring basis. We can check the state of the gpu in a terminal with nvidia-smi, but it's quite restrictive. I found an interesting article on the subject: https://towardsdatascience.com/measuring-actual-gpu-usage-for-deep-learning-training-e2bf3654bcfd
Plotting curves of GPU usage and memory as well as CPU usage and memory could help finding better ways to use the GPU if needed, eg. by changing the batch size. Depending on the user's configuration, MAX_BATCH_SIZE #232 could be determined.

Using nvidia-ml-py3 that provides Python bindings for NVIDIA Management Library and psutil, it would be possible to launch a new thread at the beginning of the program to be investigated that regularly records the states of the GPU and CPU. I can work on that if that's what you meant Arthur, but I'll probably need some help to know where to implement this feature, which would be mainly a debugging tool.

Such a tool would be awesome !
This features could be implemented in the mplc.utils maybe ? And all the dependencies would be listed only in the dev-req

Initially I asked the question because, in the functionnement of the library, we use tensorflow (via Keras) for the training, (on gpu) then we average the weight with numpy (on one cpu ?), re-split into minibatch the dataset (cpu ?) and train again. So it seems to me that there is a lot of operations on CPU, not parallelized, and lots of exhange between the gpu memory and the cpu memory.
If there is a significant time-saving to make, It will be really interesting, especially to launch ambitious benchmarks

All right, thanks for the details. I'll have a look at it soon!

Sorry, miss-clicked.
There is a major difference of computing time between training the model at once with Keras and with mplc.
I think it would be better if all the operation between the epoch could be done by the tensorflow api, but I am not sure if it is possible

Yes the weight averaging could actully be done in tensorflow, using function like this one: https://www.tensorflow.org/api_docs/python/tf/keras/backend/mean

To get info on GPU usage I used to work with gpustat.

Some profiling was already done here:

image

While working around tf.function and tf.dataset, I ran an equivalent of fedavg on full MNIST, 10 epochs, 8 minibatchs, 3 partners in 70s, against 300s with the regular mplc implementation. (V100, google colab, tf 2.4.0)

I think it's worth to go deeper that way, even if tf low-level API is something tricky to use