Get struck when change parallel to cone in single-GPU

Question

Get struck when change parallel to cone in single-GPU

Opened this issue 18 days ago · 27 comments

I can run iterative algorithms such as ossart 2D parallel beam CT reconstruction smooth on ubuntu. But when it comes for cone-beam( for 2d it is fanbeam) The running just gets stuck and prints nothing, even the estimation time.
This is the code:

import tigre
import numpy as np
from tigre.utilities import sample_loader
from tigre.utilities import CTnoise
import tigre.algorithms as algs
from scipy.io import loadmat
import matplotlib.pyplot as plt
import torch

geo = tigre.geometry()
# VARIABLE                                   DESCRIPTION                    UNITS
# -------------------------------------------------------------------------------------
# Distances
geo.DSD = 1536  # Distance Source Detector      (mm)
geo.DSO = 1000 # Distance Source Origin        (mm)
# Image parameters
geo.nVoxel = np.array([1, 256, 256])  # number of voxels              (vx)
geo.sVoxel = np.array([1, 256, 256])  # total size of the image       (mm)
geo.dVoxel = geo.sVoxel / geo.nVoxel  # size of each voxel            (mm)
print(geo.dVoxel)
# Detector parameters
geo.nDetector = np.array([1, 1512])  # number of pixels              (px)
geo.dDetector = np.array([geo.dVoxel[0], 0.8])  # size of each pixel            (mm)
# geo.dDetector = np.array([1, 0.1])
geo.sDetector = geo.nDetector * geo.dDetector  # total size of the detector    (mm)
# Offsets
geo.offOrigin = np.array([0, 0, 0])  # Offset of image from origin   (mm)
geo.offDetector = np.array([0, 0])  # Offset of Detector            (mm)
# MAKE SURE THAT THE DETECTOR PIXELS SIZE IN V IS THE SAME AS THE IMAGE!

geo.mode = "cone"

#%% Define angles of projection and load phatom image

angles = np.linspace(0, 2 * np.pi, 180)


img = sample_loader.load_head_phantom(geo.nVoxel)
limited_angle = np.linspace(0, 2*np.pi, 90)
limited_projection = tigre.Ax(img, geo, limited_angle)
tigre.plotSinogram(limited_projection, 0)

niter=500
print("start ossart_tv")
rec_img_ossart_tv = algs.ossart_tv(limited_projection, geo, limited_angle, niter)
plt.imshow(rec_img_ossart_tv.squeeze(), cmap='gray')
plt.savefig('ossart_tv_90.png', bbox_inches='tight', pad_inches=0)
plt.axis('off')

python version: 3.9
OS:Ubuntu 18.04.6 LTS
CUDA version:12.3

Answer 1 · 2024-06-20T16:39:00.000Z

That is strange! if you do it for 3D, does it work well?
Can you try to figure out where does it get stuck?

Probably related to #552

Answer 2 · 2024-06-20T17:00:26.000Z

I try to make a breakpoint and it got stuck in this column:

194        self.set_w()

of iterative_recon_alg.py

Answer 3 · 2024-06-20T17:01:27.000Z

@GreameLee and inside there, do you know where?

Answer 4 · 2024-06-20T17:11:15.000Z

@AnderBiguri yes, I step inside and it is in

222        W = Ax(
223            np.ones(geox.nVoxel, dtype=np.float32), geox, self.angles, "Siddon", gpuids=self.gpuids
224        )

Answer 5 · 2024-06-21T08:44:16.000Z

I see! I will change this function soon. For now, if you give this function geo instead of geox, I believe it should work.

Answer 6 · 2024-06-21T14:37:35.000Z

I see! I will change this function soon. For now, if you give this function geo instead of geox, I believe it should work.

Do you mean that replace all geox with geo in iterative_recon_alg.py? Cause there is an assignment about geox before this line:

216     geo = copy.deepcopy(self.geo)

I think geox is geo already

Answer 7 · 2024-06-21T14:39:05.000Z

Yes, in particular in the set_w function.

Answer 8 · 2024-06-21T15:42:29.000Z

I meant:

def set_w(self):
        """
        Calculates value of W if this is not given.
        :return: None
        """
        geo=self.geo
        W = Ax(
            np.ones(geo.nVoxel, dtype=np.float32), geo, self.angles, "Siddon", gpuids=self.gpuids
        )
        W[W <= min(self.geo.dVoxel / 2)] = np.inf
        W = 1.0 / W
        setattr(self, "W", W)

Answer 9 · 2024-06-21T15:48:35.000Z

Yes, I tried as

def set_w(self):
        """
        Calculates value of W if this is not given.
        :return: None
        """
        geo=self.geo
        W = Ax(
            np.ones(geo.nVoxel, dtype=np.float32), geo, self.angles, "Siddon", gpuids=self.gpuids
        )
        W[W <= min(self.geo.dVoxel / 2)] = np.inf
        W = 1.0 / W
        setattr(self, "W", W)

it but still got struck

Answer 10 · 2024-06-21T15:51:01.000Z

Hum, confusing. I don't really know why it happens then.... I will investigate, but its hard, as I can not make it happen in any of my 5 computers.

Answer 11 · 2024-06-21T18:19:01.000Z

Are they based on Linux Ubuntu? @AnderBiguri I run it in Ubuntu and it gets struck again in the same line. This issue won't happen in Windows

Answer 12 · 2024-06-22T03:27:46.000Z

I go deep inside that step and go into the Ax function in utilities, and find it got struck in gpu.py with this line:

31   def __len__(self):
32           return len(self.devices)

Answer 13 · 2024-06-22T09:45:28.000Z

Hum that is strange! What if you change it to return your_number_of_gpus , however many those are?

Answer 14 · 2024-06-22T17:45:10.000Z

yes, I edit it as :

def __len__(self):
        return 1

But it still get struck in following line of Ax.py:

35  return _Ax_ext(img, geox, geox.angles, projection_type, geox.mode, gpuids=gpuids)

can I ask what is the "_Ax_ext" function? I see that

from _Ax import _Ax_ext

But I did not see any function named _Ax_ext
The strange thing is that before ossarttv, there is tigre.Ax using Ax function and that line goes smooth.

Answer 15 · 2024-06-22T18:06:41.000Z

Its _Ax_ext that fails indeed, its the CUDA code for the forward projection. I still not really understand why it fails sometimes only in some machines. If you are using exactly the same geometry any algorithm should work the same. Its really confusing.

Answer 16 · 2024-06-23T22:11:01.000Z

Its _Ax_ext that fails indeed, its the CUDA code for the forward projection. I still not really understand why it fails sometimes only in some machines. If you are using exactly the same geometry any algorithm should work the same. Its really confusing.

I think this may happen on supercomputers based on Ubuntu. I tried it on another supercomputer but got struck, too. It is kind of serious issue

Answer 17 · 2024-06-24T11:16:06.000Z

I have used ~15 Ubuntu based machines and I can't reproduce :(
It is indeed a serious issue.

I suggest the following: replace set_w() such that it returns geo.sVoxel(1). Its not perfect, but it should make the algorithms work.

Answer 18 · 2024-06-24T14:41:33.000Z

I set:

def set_w(self):
    self.W = self.geo.sVoxel[1]
    return self.W

But this bug happen when it comes to "self.W[ang_index]" of this part in iterative_recon_alg.py:

 ang_index = self.angle_index[iteration].astype(np.int32)

        self.res += (
            self.lmbda
            * 1.0
            / self.V[iteration]
            * Atb(
                self.W[ang_index]

The error shows like:

IndexError: invalid index to scalar variable.

Answer 19 · 2024-06-24T14:47:55.000Z

Ah yes, sorry I made a mistake. its more or less that but not exactly what I said. I don't have time now to fix it.

If you want to fix it yourself:

Understand what W is in size.
Replace it by something of the same size but of value 1/geo.svoxel[1]
I'll try to come back and fix this at some point. Apologies, end of the year gets a bit busy .

Answer 20 · 2024-06-26T02:08:25.000Z

hello, Ander, @AnderBiguri I try to use Windows to retry this, and interesting, it runs smooth for the ossart-tv but when I try to copy the output from CPU to GPU, this error will happen:

RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

In the same codes, for parallel-beam CT, it does not have this error.
I guess when using Ax for cone-beam CT. The _Ax_ext makes the data have wrong leakage so the data can not transfer from CPU to GPU

Answer 21 · 2024-06-26T07:12:35.000Z

Are you using this with pytorch?

…

On Wed, 26 Jun 2024, 03:08 Haodong, ***@***.***> wrote: hello, Ander, @AnderBiguri <https://github.com/AnderBiguri> I try to use windows to retry this, and interesting, when it run smooth for the ossart-tv but when I try to copy the output from CPU to GPU, this error will happened: RuntimeError: CUDA error: invalid argument CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. — Reply to this email directly, view it on GitHub <#560 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC2OENF3YVUXJBBNAR7EFOTZJIPC5AVCNFSM6AAAAABJUJYU3WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJQGM4TKOJTGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Answer 22 · 2024-06-26T11:05:36.000Z

Yes, I use tensor_gpu = tensor_cpu.to(device='cuda') to copy tensor from results from tigre

Answer 23 · 2024-06-26T11:09:10.000Z

TIGRE variables you have access too are always in the CPU.

There have been problems in the past relating TIGRE and pytorch (e.g. #509) which we will look into solving

Answer 24 · 2024-06-30T16:38:19.000Z

I can run cone-beam ossart-tv by running the example code in UBUNTU now.
But when I insert this process in my diffusion model based on Pytorch and tensor in GPU. This reconstruction of ossart-tv will get struck and even I want to use "ctrl+C" to stop it can not be killed. I think I get struck due to the same reason as WINDOWS.
And interestingly, all stuff goes smoothly in parallel-beam CT. And I guess this error is same as #509.
The _Ax_ext works wrong for cone-beam CT reconstruction. This issue may be because there is some problem when transferring data

Answer 25 · 2024-06-30T17:03:07.000Z

@GreameLee That is actually quite useful information yes.

Just to clarify, TIGRE+pytorch is not suppored yet, as I have not investigated in detail how to make it work. Likely Ubuntu/windows has nothing to do with this problem.

But if it works for parallel and not cone, this is something I can dig into.

Answer 26 · 2024-07-01T13:58:45.000Z

Look forward to your exploring

Answer 27 · 2024-07-01T14:42:26.000Z

Related: #563