EMA's param and new_param on different devices when using multiple GPUs

I was training Uni-Mol using Uni-Core, on multiple GPUs (one node). However, I met the following error message:

    diff = self.param - new_param
    diff = self.param - new_param
             diff = self.param - new_param
  ~ ~ ~ ~ ~ ~ ~ ~ ~  ~  ~ ^~ ~~ ~~ ~~~~     ~~ diff = self.param - new_param~~
~~~~diff = self.param - new_param ~~~
 ^~
~~ ~~ RuntimeError~~ : ~~  Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cpu!~~
~~  ~~  ~~  ~~  ~~  ~
~ ^~ ~RuntimeError~ ~: ~ ~Expected all tensors to be on the same device, but found at least two devices, cuda:4 and cpu!~~~
~~~~~~~~~~~~~~~~~~^~
~~~~RuntimeError~~: ~^Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cpu!~
~~~~~~~~~~
~~~RuntimeError~: ~Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cpu!

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:5 and cpu!
    diff = self.param - new_param
           ~~~~~~~~~~~^~~~~~~~~~    ~diff = self.param - new_param

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
           ~~~~~~~~~~~^~~~~~~~~~~
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:7 and cpu!
    diff = self.param - new_param
           ~~~~~~~~~~~^~~~~~~~~~~
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:6 and cpu!

The direct cause is clear.

Uni-Core/unicore/ema.py

Line 47 in ec396a7

diff = self.param - new_param

Assumes self.param and new_param are on the same device, but they are not.

A workaround is to manually move them together in the update() function. However, that might hide the root cause, which is worth digging.

I encountered the same problem when training UniMol+.

My Solution:

I fixed this problem by moving the initialization of self.param to CUDA within the flatten_parameters method in the ema.py file.
Change from

Uni-Core/unicore/ema.py

Line 39 in ec396a7

flatten_param = torch.nn.Parameter(flatten_param)

to
flatten_param = torch.nn.Parameter(flatten_param).cuda()

However, I'm not entirely certain if this is the most appropriate way to address the issue. It would be helpful to get feedback from the maintainers or the community to ensure that this fix is correct and doesn't introduce any unintended consequences.