TIGRE can't run with pytorch on the same GPU

Question

TIGRE can't run with pytorch on the same GPU

huscael opened this issue 10 months ago · 10 comments

Expected Behavior

I am using pytorch and TIGRE together to do inverse projection, but I found that when I put pytorch and TIGRE on the same GPU, it will report an error. If I put pytorch and TIGRE on different GPU, it won't report an error, and I would like to ask why is that? Is there a way to run pytorch TIGRE and pytorch on the same GPU? Thanks!

Actual Behavior

When TIGRE and pytorch work the same GPU, I get the following error:

Traceback (most recent call last):
  File "inverse_problem_solver_tigre_AAPM_3d_total.py", line 231, in <module>
    x = pc_radon(score_model, scaler(img), measurement=sinogram)
  File "/data/xyl/DiffusionMBIR_for_CBCT/controllable_generation_TV_for_tigre.py", line 376, in pc_radon
    x_batch_sing, _ = predictor_denoise_update_fn(model, data, x_batch_sing, t)
  File "/data/xyl/DiffusionMBIR_for_CBCT/controllable_generation_TV_for_tigre.py", line 324, in radon_update_fn
    x, x_mean = update_fn(x, vec_t, model=model)
  File "/data/xyl/DiffusionMBIR_for_CBCT/sampling.py", line 384, in shared_predictor_update_fn
    return predictor_obj.update_fn(x, t)
  File "/data/xyl/DiffusionMBIR_for_CBCT/sampling.py", line 197, in update_fn
    f, G = self.rsde.discretize(x, t)
  File "/data/xyl/DiffusionMBIR_for_CBCT/sde_lib.py", line 105, in discretize
    rev_f = f - G[:, None, None, None] ** 2 * score_fn(x, t) * (0.5 if self.probability_flow else 1.)
  File "/data/xyl/DiffusionMBIR_for_CBCT/models/utils.py", line 177, in score_fn
    score = model_fn(x, labels)
  File "/data/xyl/DiffusionMBIR_for_CBCT/models/utils.py", line 126, in model_fn
    return model(x, labels)
  File "/home/xyl/anaconda3/envs/diffusion-mbir-cbct-copy/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/xyl/anaconda3/envs/diffusion-mbir-cbct-copy/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 158, in forward
    inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
  File "/home/xyl/anaconda3/envs/diffusion-mbir-cbct-copy/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 175, in scatter
    return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim)
  File "/home/xyl/anaconda3/envs/diffusion-mbir-cbct-copy/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 44, in scatter_kwargs
    inputs = scatter(inputs, target_gpus, dim) if inputs else []
  File "/home/xyl/anaconda3/envs/diffusion-mbir-cbct-copy/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 36, in scatter
    res = scatter_map(inputs)
  File "/home/xyl/anaconda3/envs/diffusion-mbir-cbct-copy/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 23, in scatter_map
    return list(zip(*map(scatter_map, obj)))
  File "/home/xyl/anaconda3/envs/diffusion-mbir-cbct-copy/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 19, in scatter_map
    return Scatter.apply(target_gpus, None, dim, obj)
  File "/home/xyl/anaconda3/envs/diffusion-mbir-cbct-copy/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 96, in forward
    outputs = comm.scatter(input, target_gpus, chunk_sizes, ctx.dim, streams)
  File "/home/xyl/anaconda3/envs/diffusion-mbir-cbct-copy/lib/python3.8/site-packages/torch/nn/parallel/comm.py", line 189, in scatter
    return tuple(torch._C._scatter(tensor, devices, chunk_sizes, dim, streams))
RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Code to reproduce the problem (If applicable)

The following code is just for reproducing the problem, and it's different from my actual code but is enough for showing the case.

import numpy as np

import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

import tigre
from tigre.utilities import sample_loader
from tigre.utilities import CTnoise
import tigre.algorithms as algs
from tigre.utilities import gpu

# Define a custom dataset class
class TigreDataset(Dataset):
    def __init__(self):
        geo = tigre.geometry_default(high_resolution=False)
        #%% Load data and generate projections
        # define angles
        angles = np.linspace(0, 2 * np.pi, 100)
        # Load thorax phatom data
        head = sample_loader.load_head_phantom(geo.nVoxel)
        # generate projections
        projections = tigre.Ax(head, geo, angles, gpuids=gpuids)
        # add noise
        self.noise_projections = CTnoise.add(projections, Poisson=1e5, Gaussian=np.array([0, 10]))
        # Generate random data and labels
        self.data = torch.from_numpy(np.array(self.noise_projections))
    
    def __len__(self):
        return self.noise_projections.shape[0]

    def __getitem__(self, index):
        sample = {'input': self.data[index]}
        return sample


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(128 * 128, 200)
        self.fc2 = nn.Linear(200, 200)
        self.fc3 = nn.Linear(200, 1)

    def forward(self, x):
        x = self.fc1(x)
        x = self.fc2(x)
        x = self.fc3(x)
        return F.log_softmax(x)

if __name__ == "__main__":
    custom_dataset = TigreDataset()
    data_loader = DataLoader(dataset=custom_dataset, batch_size=10, shuffle=True, num_workers=4)

    net = Net().to(torch.device('cuda:0'))
    gpus = [0,1]
    if torch.cuda.device_count() > 1:
        net = nn.DataParallel(net, device_ids=gpus, output_device=gpus[0])

    gpuids = gpu.GpuIds()
    gpuids.devices = [0, 1]
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(net.parameters(), lr=0.001)
    for epoch in range(10):
        data_agg = list()
        for batch_idx, batch in enumerate(data_loader):
            data = batch['input'].to(torch.device('cuda:0'))
            data = data.view(-1, 128*128)
            print(data.shape)
            optimizer.zero_grad()
            net_out = net(data).to(torch.device('cuda:0'))
            loss = criterion(net_out, torch.tensor([0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5], dtype=torch.long).to(torch.device('cuda:0')))
            loss.backward()
            optimizer.step()
            data_agg.append(data)

        data_agg_all = torch.cat(data_agg, dim=0)
        data_agg_all = data_agg_all.detach().cpu().numpy()
        data_agg_all = data_agg_all.reshape(-1, 128, 128)
        imgFDK_agg_all = algs.fdk(data_agg_all, geo, angles, gpuids=gpuids)

Specifications

python version: 3.8.17
OS: Ubuntu 20.04.5 LTS
GPU: NVIDIA GeForce RTX 3090
CUDA version:11.4
TIGRE version: 2.5

Answer 1 · 2023-12-02T12:06:19.000Z

Hi! Thanks for the big report.
The first time we ourselves have tried pytorch and TIGRE has been few days ago in #508, so the answer is that we don't know, but we are working on having this functional soon.

Hopefully when the PR is merged we also fix this issue together.

Answer 2 · 2023-12-05T12:45:53.000Z

Hi! Could you please forward the error message you get when running your script with
CUDA_LAUNCH_BLOCKING=1, you can do that by running CUDA_LAUNCH_BLOCKING=1 python my_script.py
Also, the lines:

if torch.cuda.device_count() > 1:
        net = nn.DataParallel(net, device_ids=gpus, output_device=gpus[0])

Might cause problems. From experience, using that lines with libraries that interact with the GPU in an intensive way cause problems (I had that in ASTRA+ODL). In the error trace, I see that something in torch/nn/parallel is not happy, so if you could remove this line and tell us if anything changes that would be great :)

Answer 3 · 2023-12-07T10:13:36.000Z

Hi! Could you please forward the error message you get when running your script with CUDA_LAUNCH_BLOCKING=1, you can do that by running CUDA_LAUNCH_BLOCKING=1 python my_script.py Also, the lines:
if torch.cuda.device_count() > 1:
        net = nn.DataParallel(net, device_ids=gpus, output_device=gpus[0])
Might cause problems. From experience, using that lines with libraries that interact with the GPU in an intensive way cause problems (I had that in ASTRA+ODL). In the error trace, I see that something in torch/nn/parallel is not happy, so if you could remove this line and tell us if anything changes that would be great :)

I've add CUDA_LAUNCH_BLOCKING=1 in command line, and this makes no difference in error message. And torch/nn/parallel is a must in my real code to accelerate the training process, otherwise it would run for days :-(

Answer 4 · 2023-12-07T10:48:52.000Z

@huscael thanks for the test! Indeed there could be issues with DataParallel. I don't fully understand how it works internally, but TIGRE requires the inputs to be in CPU numpy arrays. if DataParallel puts things in the GPU, then the input to TIGRE can be a GPU array, and thus trigger an "invalid argument" error inside of it.

Not saying its this 100%, as I don't know what DataParallel does internally, but it could be.

Answer 5 · 2023-12-07T11:04:32.000Z

Perhaps its related to how you pass the gpuids to TIGRE. Can you somehow grab the current GPU from the DataParallel isntance, and pass that one in? TIGRE will split the operation between all the GPUs in gpuids, while DataParallel expects that a given data point (a given instance of head) runs in a given GPU.

Answer 6 · 2023-12-09T13:50:17.000Z

@huscael thanks for the test! Indeed there could be issues with DataParallel. I don't fully understand how it works internally, but TIGRE requires the inputs to be in CPU numpy arrays. if DataParallel puts things in the GPU, then the input to TIGRE can be a GPU array, and thus trigger an "invalid argument" error inside of it.

Not saying its this 100%, as I don't know what DataParallel does internally, but it could be.

this line data_agg_all = data_agg_all.detach().cpu().numpy() ensures that the input to TIGRE is a cpu numpy array.

Answer 7 · 2023-12-09T14:10:28.000Z

Perhaps its related to how you pass the gpuids to TIGRE. Can you somehow grab the current GPU from the DataParallel isntance, and pass that one in? TIGRE will split the operation between all the GPUs in gpuids, while DataParallel expects that a given data point (a given instance of head) runs in a given GPU.

I have tried to set gpuids to [0] for both TIGRE and DataParallel, and this Invalid Argument error still occures. As of my current remedy is to use more gpus than I really need, e.g. set device id [0,1,2] for DataParallel and gpuids [3] for TIGRE. Under such circumstance this error finally disappears, but TIGRE uses very little of that gpu device, other colleages would run their programs on that gpu unintentionally and cause the same Invalid Argument error to my program. In addition, this way requires more gpu, and gpu resource is not very sufficient in my lab. So here I'm asking for a way to run pytorch TIGRE and DataParallel on the same GPU, thanks a lot!

Answer 8 · 2023-12-09T14:26:13.000Z

this line data_agg_all = data_agg_all.detach().cpu().numpy() ensures that the input to TIGRE is a cpu numpy array.

That's for fdk, but I suspect the error is caused by the data parallel inside the data loader/dataset. the call to Ax, not the call to fdk. Dataparallels point is to put the data in gpus.

In any case, I don't know how DataParallel parallelizes the loader exactly (but it does put things on gpus, which may cause the problems as I said). As we are working on getting Tigre a bit more pytorch compatible we may find the issue, but for now the only thing I can say is that I don't know and it's not technically a supported feature, so technically not a bug.

Hopefully I can give you a better answer at some point. I'll ping you if I find an answer

Answer 9 · 2024-01-18T21:14:25.000Z

I also have a similar error when I run the Tigre and torch on the same GPU. After I use tigre.Ax() to generate projections, I can no longer push any data to the GPU with torch. And it always raises a CUDA error: an illegal memory access was encountered.

I think the Tigre toolbox may change some global parameters or environments about the GPU, and cause the torch to fail to connect the GPU drive.

Answer 10 · 2024-01-19T10:38:48.000Z

@ldy1995 still unsure what the issue is, but in theory TIGRE should create a context and destroy it every time its called. i.e. as opposed to pytorch that holds the GPU memory all the time, each Ax() or Atb() TIGRE call should be a new and independent call to the GPU that opens and closes the session. Clearly this is not happening, but not entirely sure why.

I will be working on making TIGRE torch compatible soon, so hopefully we can fix this. Any ideas welcome, of course.