mllam/neural-lam

Multi-GPU training

Closed this issue · 1 comments

I realized that multi-GPU training is currently broken. Luckily I believe this should be a simple fix, just making sure that logging + the storage of tensors in model classes conforms to the lightning setup properly.

Not as simple of a fix as I originally thought, but this is fixed with commit 89a4c63. Implementation should now work on CPU, single-GPU and multi-GPU.

A couple things to keep in mind from this fix:

  • Evaluation should not be ran on multiple devices. This can be fixed once Lightning handles this properly with the DistributedSampler.
  • I had to make my own BufferList for the graph tensors. This should be switched to native torch BufferLists once implemented.