CERN/TIGRE

Why the Tigre module doesn't use GPU fully 100%?

TianSong1991 opened this issue · 3 comments

Expected Behavior

I test the python demo d07_Algorithms02.py and I run my own data. I hope the fdk and sart algorithm use the GPU 100%, but I see the GPU only use 5%-7%.

Actual Behavior

When I add gpuids in tigre, but it doesn't work well. It only use GPU only 5%-7%.

Code to reproduce the problem (If applicable)

listGpuNames = gpu.getGpuNames()
if len(listGpuNames) == 0:
    print("Error: No gpu found")
else:
    for id in range(len(listGpuNames)):
        print("{}: {}".format(id, listGpuNames[id]))

gpuids = gpu.getGpuIds(listGpuNames[0])
print(gpuids)

imgSIRT, qualitySIRT = algs.sirt(
    projections,
    geo,
    angles,
    2,
    lmbda=lmbda,
    lmbda_red=lambdared,
    verbose=verbose,
    Quameasopts=qualmeas,
    computel2=True,
    gpuids = gpuids,
)
imgSART, qualitySART = algs.sart(
    projections,
    geo,
    angles,
    2,
    lmbda=lmbda,
    lmbda_red=lambdared,
    verbose=verbose,
    Quameasopts=qualmeas,
    computel2=True,
gpuids = gpuids,
)

Specifications

  • MATLAB/python version: python
  • OS: win 10
  • CUDA version:11.7

But the CIL module uses Tigre as a backend to run the fdk reconstruction algorithm, it use the gpu fully 100%. I test https://github.com/TomographicImaging/CIL-Demos/blob/main/demos/1_Introduction/01_intro_walnut_conebeam.ipynb . I feel very confused. Calling tigre to run fdk in the CIL module can utilize the GPU 100%, but using the GPU in the Tigre module cannot fully utilize the GPU 100%, and can only utilize 5%-7%.

Short answer: Don't worry, when TIGRE is using the GPU, it uses 100% of compute. Your measurements and expectations are wrong.

Long answer: This is a combination of various thing that you are measuring. As, said above, when TIGRE is using the GPU, it will use 100% of compute. However, few things affect this:

1- The algorithm you use. Due to the modular requirements of TIGRE, there so it can support various algorithms, not all the of algorithms are as efficient they can be. In particular, the algorithm does not run on GPU, as writing fully CUDA algorithms is a very long and arduous task and it would make the "larger than memory recon" feature of TIGRE almost impossible to support. Instead, in TIGRE, the forward and backprojection operations run on GPU. These are 99% of the compute time anyway, so in general, you should see mostly GPU usage at top. But, some algorithms (SART, in particular) wants to update the image projection by projection, and therefore it does many very small forward and backprojections per iteration. This means that the GPU time is small (at a time) and that there is a lot of overhead of memory being passed in and out of the GPU all the time.
I suspect that when you use SIRT, you should be able to measure almost 100% of compute.
FDK in TIGRE uses CPU filtering (because we have not found it to be much slower that GPU one, see work on #423) so most of the time is not on the GPU anyway.

2- Your measurement is flawed. When you measure compute (e.g. nvidia-smi) it averages over some tick. So, if your data is small and uses 100% of the GPU for 10ms, but nvidia-smi reports each 100ms, you will get a 10% compute usage time. I can ensure you that when TIGRE is using the GPU, it uses the 100% of the resources, TIGRE is on the cutting edge of speed for forward/backprojection. So, when you don't see 100%, is because TIGRE is being too fast for you to measure, which should be a good thing!

I have worked with CIL people to get the TIGRE projectors there and I can assure there is nothing inherently different on the way they use them with how TIGRE uses them. For FDK, as said, they use GPU filtering, which we are working on adding to TIGRE too, but as said, is not significantly faster.

How could you make it be 100% all the time? By recoding the algorithms to be 100% GPU bound. For e.g. FDK or SIRT, you won't gain a lot of time, but for e.g. SART, you can speed up the algorithm by x10 or so, just because the memory ping-pong between the CPU and GPU takes long time. But this reconding is quite time consuming, only do it if it a critical factor. It took me about a month of work to code ASD-POCS in GPU only, for a project that I worked on.

@AnderBiguri Thank you very much for your quick reply, both Tigre and CIL are doing great work, thank you very much! I will continue to explore reconstruction algorithms.