Model split and GPU memory
Opened this issue · 2 comments
The mpi-caffe CIFAR10 example doesn't seem to split the AlexNet model between multiple GPUs (I didn't looked in details at examples/cifar10-mpi/cifar10_mpi_train_test.prototxt
). Below are the output of Caffe's training on the CIFAR10 example followed by the same training but used by mpi-caffe. When looking at the memory used by GPU 0, it seems that the entire model (~220 MB) is hosted on GPU 0 when using mpi-caffe. Can you provide a modified version of examples/cifar10-mpi/cifar10_mpi_train_test.prototxt
where the model is effectively split between three GPUs?
+------------------------------------------------------+
| NVIDIA-SMI 352.99 Driver Version: 352.99 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000:83:00.0 Off | 0 |
| N/A 52C P0 146W / 149W | 269MiB / 11519MiB | 95% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 0000:84:00.0 Off | 0 |
| N/A 35C P8 32W / 149W | 55MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 Off | 0000:87:00.0 Off | 0 |
| N/A 38C P8 26W / 149W | 55MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 Off | 0000:88:00.0 Off | 0 |
| N/A 35C P8 29W / 149W | 55MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 16473 C ./build/tools/caffe 212MiB |
+-----------------------------------------------------------------------------+
and here is the output for mpi-caffe CIFAR10 example:
+------------------------------------------------------+
| NVIDIA-SMI 352.99 Driver Version: 352.99 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000:83:00.0 Off | 0 |
| N/A 49C P0 128W / 149W | 258MiB / 11519MiB | 78% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 0000:84:00.0 Off | 0 |
| N/A 40C P0 85W / 149W | 175MiB / 11519MiB | 32% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 Off | 0000:87:00.0 Off | Off |
| N/A 45C P0 71W / 149W | 176MiB / 12287MiB | 34% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 Off | 0000:88:00.0 Off | Off |
| N/A 38C P0 73W / 149W | 56MiB / 12287MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
Please see http://homes.soic.indiana.edu/steflee/mpi-caffe.html for a full description of the cifar10-mpi example. The overview is this example replicates a model across each GPU and combines the output.
To split a single path model across multiple GPUs, you would use MPIBroadcast layers with communication groups containing only the source GPU (i.e. the one with the preceding layers assigned to it) and the next GPU (i.e. the one to receive the output). The MPIBroadcast output on the source GPU will need to be fed into a silence layer.