GPU memory leak

Question

GPU memory leak

m-pilia opened this issue 6 years ago · 12 comments

I ran two batches of registrations overnight (scripted with the Python API) on two different machines, and both crashed after about 30 registrations due to OOM on the device. I just launched one batch again, and it's leaking around 290 MiB of device memory for each registration.

Answer 1 · 2018-10-25T10:02:16.000Z

Oh, that's bad. I'll take a look. Probably some weird thing I didn't consider with constructors/destructors on CUDA.

Answer 2 · 2018-10-25T10:08:16.000Z

I have been giving a look at this this morning. 290 MiB ≈ 302 MB is the total size of fixed and moving pyramids when registering two POEM volumes using two channels. A run with cuda-memcheck confirms that the leak is coming from GpuRegistrationEngine, mostly from set_image_pair. However, it seems the mask pyramid is not leaking, so it is probably something wrong with the vector of pyramids. I am trying to dig into this...

Answer 3 · 2018-10-25T11:18:41.000Z

Looking into it it seems like there's some GpuVolume not being released somewhere. There are two dangling GpuVolumeDatas that causes the leak.

Answer 4 · 2018-10-25T11:21:19.000Z

That, plus the two thrust::device_vector instances in the GPU landmark cost function.

A fix is to manually call the destructor of the image pyramids within the destructor of GpuRegistrationEngine, but I would like to understand why it is not called automatically.

Answer 5 · 2018-10-25T11:24:32.000Z

I was thinking that maybe the shared_pointer of the _volume_data gets captured somewhere by accident.

Answer 6 · 2018-10-25T11:26:09.000Z

The destructor of GpuVolumePyramid seems to be invoked, so calling it again is probably just cleaning up somebody else's mess.

Answer 7 · 2018-10-25T11:48:54.000Z

I have found the problem. I think we were pretty much misusing unique_ptr in the GpuUnaryFunction, replacing it with a shared_ptr seems to solve all the issues.

Answer 8 · 2018-10-25T11:58:24.000Z

Hm, but how is it misused? It should be unique, right? Otherwise we would have the same problem on CPU.

Answer 9 · 2018-10-25T11:59:12.000Z

Ohh, GpuSubFunction have no destructor!

Answer 10 · 2018-10-25T12:05:49.000Z

Oh, that makes sense!

Answer 11 · 2018-10-25T12:15:27.000Z

Weird, shouldn't we have had the same problem with a shared_ptr too?

As a side note, we should probably have a pure virtual destructor for this kind of base classes.

Answer 12 · 2018-10-25T12:19:53.000Z

Ok, this explains it.