多卡训练

Question

多卡训练

Closed this issue 7 months ago · 5 comments

首先，非常感谢您提供的训练代码！
我这边在使用多卡训练模型，在调用 sample_farthest_points 时遇到一个问题，想请教一下。

from pytorch3d.ops import sample_farthest_points
....
center, _ = sample_farthest_points(xyz, K=self.num_group) # [B, npoint, 3] [B, npoint]

模型报错：
RuntimeError: Caught RuntimeError in replica 1 on device 1.
....
RuntimeError: CUDA error: too many resources requested for launch
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

在单卡训练时，模型没有问题可以跑起来；但是好像多卡会遇到上面的问题。
请问您是否使用多卡训练？

Answer 1 · 2024-04-20T09:09:42.000Z

你好，我是用多卡训练的。

Answer 2 · 2024-04-20T09:42:21.000Z

您好，请问您有没有遇到我上面的问题呀？

Answer 3 · 2024-04-20T10:50:37.000Z

我在跑的时候没遇到过这个问题，或许你可以调小一点batchsize试一下。

Answer 4 · 2024-04-21T07:53:03.000Z

好的，谢谢您。我试一下

Answer 5 · 2024-04-23T06:25:27.000Z

这个问题我解决了，是我的batch size 不是卡数的整数倍引起的。谢谢