Multi-GPU training

Question

Multi-GPU training

Closed this issue a year ago · 1 comments

I realized that multi-GPU training is currently broken. Luckily I believe this should be a simple fix, just making sure that logging + the storage of tensors in model classes conforms to the lightning setup properly.

Answer 1 · 2023-10-17T12:07:52.000Z

Not as simple of a fix as I originally thought, but this is fixed with commit 89a4c63. Implementation should now work on CPU, single-GPU and multi-GPU.

A couple things to keep in mind from this fix:

Evaluation should not be ran on multiple devices. This can be fixed once Lightning handles this properly with the DistributedSampler.
I had to make my own BufferList for the graph tensors. This should be switched to native torch BufferLists once implemented.