CERN/TIGRE

Get struck when change parallel to cone in single-GPU

Opened this issue · 27 comments

I can run iterative algorithms such as ossart 2D parallel beam CT reconstruction smooth on ubuntu. But when it comes for cone-beam( for 2d it is fanbeam) The running just gets stuck and prints nothing, even the estimation time.
This is the code:

import tigre
import numpy as np
from tigre.utilities import sample_loader
from tigre.utilities import CTnoise
import tigre.algorithms as algs
from scipy.io import loadmat
import matplotlib.pyplot as plt
import torch

geo = tigre.geometry()
# VARIABLE                                   DESCRIPTION                    UNITS
# -------------------------------------------------------------------------------------
# Distances
geo.DSD = 1536  # Distance Source Detector      (mm)
geo.DSO = 1000 # Distance Source Origin        (mm)
# Image parameters
geo.nVoxel = np.array([1, 256, 256])  # number of voxels              (vx)
geo.sVoxel = np.array([1, 256, 256])  # total size of the image       (mm)
geo.dVoxel = geo.sVoxel / geo.nVoxel  # size of each voxel            (mm)
print(geo.dVoxel)
# Detector parameters
geo.nDetector = np.array([1, 1512])  # number of pixels              (px)
geo.dDetector = np.array([geo.dVoxel[0], 0.8])  # size of each pixel            (mm)
# geo.dDetector = np.array([1, 0.1])
geo.sDetector = geo.nDetector * geo.dDetector  # total size of the detector    (mm)
# Offsets
geo.offOrigin = np.array([0, 0, 0])  # Offset of image from origin   (mm)
geo.offDetector = np.array([0, 0])  # Offset of Detector            (mm)
# MAKE SURE THAT THE DETECTOR PIXELS SIZE IN V IS THE SAME AS THE IMAGE!

geo.mode = "cone"

#%% Define angles of projection and load phatom image

angles = np.linspace(0, 2 * np.pi, 180)


img = sample_loader.load_head_phantom(geo.nVoxel)
limited_angle = np.linspace(0, 2*np.pi, 90)
limited_projection = tigre.Ax(img, geo, limited_angle)
tigre.plotSinogram(limited_projection, 0)

niter=500
print("start ossart_tv")
rec_img_ossart_tv = algs.ossart_tv(limited_projection, geo, limited_angle, niter)
plt.imshow(rec_img_ossart_tv.squeeze(), cmap='gray')
plt.savefig('ossart_tv_90.png', bbox_inches='tight', pad_inches=0)
plt.axis('off')
  • python version: 3.9
  • OS:Ubuntu 18.04.6 LTS
  • CUDA version:12.3

That is strange! if you do it for 3D, does it work well?
Can you try to figure out where does it get stuck?

Probably related to #552

I try to make a breakpoint and it got stuck in this column:

194        self.set_w()

of iterative_recon_alg.py

@GreameLee and inside there, do you know where?

@AnderBiguri yes, I step inside and it is in

222        W = Ax(
223            np.ones(geox.nVoxel, dtype=np.float32), geox, self.angles, "Siddon", gpuids=self.gpuids
224        )

I see! I will change this function soon. For now, if you give this function geo instead of geox, I believe it should work.

I see! I will change this function soon. For now, if you give this function geo instead of geox, I believe it should work.

Do you mean that replace all geox with geo in iterative_recon_alg.py? Cause there is an assignment about geox before this line:

216     geo = copy.deepcopy(self.geo)

I think geox is geo already

Yes, in particular in the set_w function.

I meant:

def set_w(self):
        """
        Calculates value of W if this is not given.
        :return: None
        """
        geo=self.geo
        W = Ax(
            np.ones(geo.nVoxel, dtype=np.float32), geo, self.angles, "Siddon", gpuids=self.gpuids
        )
        W[W <= min(self.geo.dVoxel / 2)] = np.inf
        W = 1.0 / W
        setattr(self, "W", W)

Yes, I tried as

def set_w(self):
        """
        Calculates value of W if this is not given.
        :return: None
        """
        geo=self.geo
        W = Ax(
            np.ones(geo.nVoxel, dtype=np.float32), geo, self.angles, "Siddon", gpuids=self.gpuids
        )
        W[W <= min(self.geo.dVoxel / 2)] = np.inf
        W = 1.0 / W
        setattr(self, "W", W)

it but still got struck

Hum, confusing. I don't really know why it happens then.... I will investigate, but its hard, as I can not make it happen in any of my 5 computers.

I go deep inside that step and go into the Ax function in utilities, and find it got struck in gpu.py with this line:

31   def __len__(self):
32           return len(self.devices)

Hum that is strange! What if you change it to return your_number_of_gpus , however many those are?

yes, I edit it as :

def __len__(self):
        return 1

But it still get struck in following line of Ax.py:

35  return _Ax_ext(img, geox, geox.angles, projection_type, geox.mode, gpuids=gpuids)

can I ask what is the "_Ax_ext" function? I see that

from _Ax import _Ax_ext

But I did not see any function named _Ax_ext
The strange thing is that before ossarttv, there is tigre.Ax using Ax function and that line goes smooth.

Its _Ax_ext that fails indeed, its the CUDA code for the forward projection. I still not really understand why it fails sometimes only in some machines. If you are using exactly the same geometry any algorithm should work the same. Its really confusing.

Its _Ax_ext that fails indeed, its the CUDA code for the forward projection. I still not really understand why it fails sometimes only in some machines. If you are using exactly the same geometry any algorithm should work the same. Its really confusing.

I think this may happen on supercomputers based on Ubuntu. I tried it on another supercomputer but got struck, too. It is kind of serious issue

I have used ~15 Ubuntu based machines and I can't reproduce :(
It is indeed a serious issue.

I suggest the following: replace set_w() such that it returns geo.sVoxel(1). Its not perfect, but it should make the algorithms work.

I set:

def set_w(self):
    self.W = self.geo.sVoxel[1]
    return self.W

But this bug happen when it comes to "self.W[ang_index]" of this part in iterative_recon_alg.py:

 ang_index = self.angle_index[iteration].astype(np.int32)

        self.res += (
            self.lmbda
            * 1.0
            / self.V[iteration]
            * Atb(
                self.W[ang_index]

The error shows like:

IndexError: invalid index to scalar variable.

Ah yes, sorry I made a mistake. its more or less that but not exactly what I said. I don't have time now to fix it.

If you want to fix it yourself:

  • Understand what W is in size.
  • Replace it by something of the same size but of value 1/geo.svoxel[1]
    I'll try to come back and fix this at some point. Apologies, end of the year gets a bit busy .

hello, Ander, @AnderBiguri I try to use Windows to retry this, and interesting, it runs smooth for the ossart-tv but when I try to copy the output from CPU to GPU, this error will happen:

RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

In the same codes, for parallel-beam CT, it does not have this error.
I guess when using Ax for cone-beam CT. The _Ax_ext makes the data have wrong leakage so the data can not transfer from CPU to GPU

TIGRE variables you have access too are always in the CPU.

There have been problems in the past relating TIGRE and pytorch (e.g. #509) which we will look into solving

I can run cone-beam ossart-tv by running the example code in UBUNTU now.
But when I insert this process in my diffusion model based on Pytorch and tensor in GPU. This reconstruction of ossart-tv will get struck and even I want to use "ctrl+C" to stop it can not be killed. I think I get struck due to the same reason as WINDOWS.
And interestingly, all stuff goes smoothly in parallel-beam CT. And I guess this error is same as #509.
The _Ax_ext works wrong for cone-beam CT reconstruction. This issue may be because there is some problem when transferring data

@GreameLee That is actually quite useful information yes.

Just to clarify, TIGRE+pytorch is not suppored yet, as I have not investigated in detail how to make it work. Likely Ubuntu/windows has nothing to do with this problem.

But if it works for parallel and not cone, this is something I can dig into.

Look forward to your exploring

Related: #563