CERN/TIGRE

FDK MultiGPU

epapoutsellis opened this issue · 16 comments

Hello,

I have a problem running FDK with 2 gpus. I posted also in the discussions.

If I run the following lines

gpuids = gpu.getGpuIds()
img = tigre.algorithms.fdk(stack_img, geo, angles, gpuids = gpuids)

it does not seem (e.g. nvidia-smi, task manager) that both my two gpus are utilized.

OS : Windows
Python version : 3.9
CUDA : 11.0

Actually, most of the time I get a cuda memory error and I believe 2x48Gb are sufficient for my large data. Also the 2 Gpus are working if I use other libraries cupy, pytorch etc.

Any ideas?

Hi Vaggelis!
That's interesting.... By default Tigre uses all gpus available, even if you don't give the gpuids parameter. So you should be seing usage in all available gpus.
But, it could be something else: fdk in Tigre is not very optimizer, in particular, the filtering is done in cpu. Could it be that the gpu part, the backprojection, is too fast?
Let's make sure first this is not a measurement error: can you try an Iterative method?

That is of course for the case when it doesn't error. Can you share the error? Again, Tigre in theory doesn't care for gpu size. It will split the problem in as many pieces as needed to fit inside the gpu, whichever the size.

Maybe knowing the error let's me dig deeper. Can you also give me the gpu models and geometry of the problem?

Hi Ander, thanks for the quick reply.

I run d20_Algorithms_05.py and I could see that both gpus were used. I think you are right, in FDK the gpu part is really fast and I cannot see some gpu activity.

Is there any plan to replace the filtering step by cupy or maybe better cunumeric?

However, in some cases my jupyter lab kernel dies with the following error

../Common/CUDA/TIGRE_common.cpp (7): Error pinning memory
../Common/CUDA/TIGRE_common.cpp (14): CBCT:CUDA:Atb invalid argument

Geo info (1000 angles):

TIGRE parameters
-----
Geometry parameters
Distance from source to detector (DSD) = 842.178 mm
Distance from source to origin (DSO)= 327.209 mm
-----
Detector parameters
Number of pixels (nDetector) = [4096 4096]
Size of each pixel (dDetector) = [0.1 0.1] mm
Total size of the detector (sDetector) = [409.6 409.6] mm
-----
Image parameters
Number of voxels (nVoxel) = [4096 4096 4096]
Total size of the image (sVoxel) = [159.14071182 159.14071182 159.14071182] mm
Size of each voxel (dVoxel) = [0.03885271 0.03885271 0.03885271] mm
-----
-----
Auxillary parameters
Samples per pixel of forward projection (accuracy) = 0.5
-----
Rotation of the Detector (rotDetector) = [0.         0.         3.14159265] rad
-----
Centre of rotation correction (COR) = 0 mm

GPUs : 2x48 Quadro RTX 8000

Hey @epapoutsellis !
Ah, glad it was that.
About filtering: I certainly wan't to have a GPU implementation of that, it should be easy to make and should speed up the process quite a lot, but I just don't have the time. So there is a plan, in the sense that I want to do it, but it will very likely not happen anytime soon because I have too many things on my plate! If you or someone you know decide to implement it I'd be very happy to add it to TIGRE.

About the other error: weird. But: are you using the latests "stable" release, or the latest commit in master? There is a bug in the latest stable that may be the cause of it, but the up-to-date master should not have it anymore.

Hi @AnderBiguri, I have installed everything from master but I still get this error

../Common/CUDA/TIGRE_common.cpp (7): Error pinning memory
../Common/CUDA/TIGRE_common.cpp (14): CBCT:CUDA:Atb out of memory
[I 2022-11-18 18:56:44.709 ServerApp] AsyncIOLoopKernelRestarter: restarting kernel (1/5), keep random ports

Hi @epapoutsellis
It seems that is complaining that you dotn have enough RAM on the CPU. Could this be the case?

I thought that it was a GPU error. I have reduced the number of projections and still get this error. I am using TIGRE via CIL.

Input Data:
angle: 96
vertical: 3096
horizontal: 3096

Reconstruction Volume:
vertical: 3096
horizontal_y: 3096
horizontal_x: 3096

I'll have a further look tomorrow, but in Tigre, the gpu allocates CPU memory for the back projector. It's pinned memory, meaning the gpu has full asynchronous write access, which is why the gpu needs to allocate it. How much CPU ram do you have? Because indeed that volume is quite big!

Apologies for the late reply.

CPU: 351 Gb
GPU: 2x48Gb Quadro RTX8000

Hum, the image itself takes about 110GB, so if you don't have much other things on RAM, indeed this should work.

I need to look this with more careful detail, I'm not entirely sure what's wrong, but also I can't test it myself because I don't have access to a machine with the that much ram right now.

I did recently fix an issue where the amount of memory needed was overestimated and thus it would crash, so I wonder if I didn't fix this bug completely....

If you have MATLAB access in that machine, could you try with the MATLAB version? Just to try to pinpoint the issue. Otherwise no worries.

It's worth adding that if this is using TIGRE with CIL (speculating here!) then there will be additional copies in memory. We currently pass back and forth to TIGRE by copying (it's very sad!) so that's an extra copy of the data and reconstruction. Then we don't corrupt the original data so filtering adds a copy. I would guess that would still be ok, but it adds up fast.

@gfardell @epapoutsellis that may be the cause of the issue. If at some point there are 3 copies, you are already borderline with RAM usage, and depends on how you have the data loaded, (i.e. do you load all and the slice? this could be an issue.

Also, I know CIL doesn't build against master necesarily, I know Vaggelis you did comment that you are, so this should not be an issue then, but if for some reason you are using the latest tagged release, the python code does have a bug where it overstimates the memory, as said before, and does crash with out of memory.

I have tested both pure TIGRE and CIL+TIGRE and receive the same error.
Using CIL+TIGRE I reconstructed 13 datasets with a recon volume close to 2K^3. But for recon volume close to 3K^3 it does not work. If I use astra backend I get some strange artifacts. For the installation, I installed the latest TIGRE master.

@epapoutsellis Hum, there may be some issue with the way we handle the sizes internally. I will try to find a way to create such sizes, and test internally the numbers. I'll keep you updated.

thank you @AnderBiguri for your help!!!

Hi @epapoutsellis , Sorry for the long delay. I can't reproduce this, unless you run out of memory. Any news from your side? Did you manage to make it work?

@epapoutsellis @gfardell I will close this issue, as I can not reproduce now.

I have update TIGRE (latest commit on master) to allow disabling pinned memory, therefore allowing systems with swap memory activated to use TIGRE (as before it was not allowed). I suspect this could have been the original problem.