TIGRE can't run with pytorch on the same GPU
huscael opened this issue · 10 comments
Expected Behavior
I am using pytorch and TIGRE together to do inverse projection, but I found that when I put pytorch and TIGRE on the same GPU, it will report an error. If I put pytorch and TIGRE on different GPU, it won't report an error, and I would like to ask why is that? Is there a way to run pytorch TIGRE and pytorch on the same GPU? Thanks!
Actual Behavior
When TIGRE and pytorch work the same GPU, I get the following error:
Traceback (most recent call last):
File "inverse_problem_solver_tigre_AAPM_3d_total.py", line 231, in <module>
x = pc_radon(score_model, scaler(img), measurement=sinogram)
File "/data/xyl/DiffusionMBIR_for_CBCT/controllable_generation_TV_for_tigre.py", line 376, in pc_radon
x_batch_sing, _ = predictor_denoise_update_fn(model, data, x_batch_sing, t)
File "/data/xyl/DiffusionMBIR_for_CBCT/controllable_generation_TV_for_tigre.py", line 324, in radon_update_fn
x, x_mean = update_fn(x, vec_t, model=model)
File "/data/xyl/DiffusionMBIR_for_CBCT/sampling.py", line 384, in shared_predictor_update_fn
return predictor_obj.update_fn(x, t)
File "/data/xyl/DiffusionMBIR_for_CBCT/sampling.py", line 197, in update_fn
f, G = self.rsde.discretize(x, t)
File "/data/xyl/DiffusionMBIR_for_CBCT/sde_lib.py", line 105, in discretize
rev_f = f - G[:, None, None, None] ** 2 * score_fn(x, t) * (0.5 if self.probability_flow else 1.)
File "/data/xyl/DiffusionMBIR_for_CBCT/models/utils.py", line 177, in score_fn
score = model_fn(x, labels)
File "/data/xyl/DiffusionMBIR_for_CBCT/models/utils.py", line 126, in model_fn
return model(x, labels)
File "/home/xyl/anaconda3/envs/diffusion-mbir-cbct-copy/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/xyl/anaconda3/envs/diffusion-mbir-cbct-copy/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 158, in forward
inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
File "/home/xyl/anaconda3/envs/diffusion-mbir-cbct-copy/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 175, in scatter
return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim)
File "/home/xyl/anaconda3/envs/diffusion-mbir-cbct-copy/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 44, in scatter_kwargs
inputs = scatter(inputs, target_gpus, dim) if inputs else []
File "/home/xyl/anaconda3/envs/diffusion-mbir-cbct-copy/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 36, in scatter
res = scatter_map(inputs)
File "/home/xyl/anaconda3/envs/diffusion-mbir-cbct-copy/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 23, in scatter_map
return list(zip(*map(scatter_map, obj)))
File "/home/xyl/anaconda3/envs/diffusion-mbir-cbct-copy/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 19, in scatter_map
return Scatter.apply(target_gpus, None, dim, obj)
File "/home/xyl/anaconda3/envs/diffusion-mbir-cbct-copy/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 96, in forward
outputs = comm.scatter(input, target_gpus, chunk_sizes, ctx.dim, streams)
File "/home/xyl/anaconda3/envs/diffusion-mbir-cbct-copy/lib/python3.8/site-packages/torch/nn/parallel/comm.py", line 189, in scatter
return tuple(torch._C._scatter(tensor, devices, chunk_sizes, dim, streams))
RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Code to reproduce the problem (If applicable)
The following code is just for reproducing the problem, and it's different from my actual code but is enough for showing the case.
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import tigre
from tigre.utilities import sample_loader
from tigre.utilities import CTnoise
import tigre.algorithms as algs
from tigre.utilities import gpu
# Define a custom dataset class
class TigreDataset(Dataset):
def __init__(self):
geo = tigre.geometry_default(high_resolution=False)
#%% Load data and generate projections
# define angles
angles = np.linspace(0, 2 * np.pi, 100)
# Load thorax phatom data
head = sample_loader.load_head_phantom(geo.nVoxel)
# generate projections
projections = tigre.Ax(head, geo, angles, gpuids=gpuids)
# add noise
self.noise_projections = CTnoise.add(projections, Poisson=1e5, Gaussian=np.array([0, 10]))
# Generate random data and labels
self.data = torch.from_numpy(np.array(self.noise_projections))
def __len__(self):
return self.noise_projections.shape[0]
def __getitem__(self, index):
sample = {'input': self.data[index]}
return sample
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.fc1 = nn.Linear(128 * 128, 200)
self.fc2 = nn.Linear(200, 200)
self.fc3 = nn.Linear(200, 1)
def forward(self, x):
x = self.fc1(x)
x = self.fc2(x)
x = self.fc3(x)
return F.log_softmax(x)
if __name__ == "__main__":
custom_dataset = TigreDataset()
data_loader = DataLoader(dataset=custom_dataset, batch_size=10, shuffle=True, num_workers=4)
net = Net().to(torch.device('cuda:0'))
gpus = [0,1]
if torch.cuda.device_count() > 1:
net = nn.DataParallel(net, device_ids=gpus, output_device=gpus[0])
gpuids = gpu.GpuIds()
gpuids.devices = [0, 1]
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(net.parameters(), lr=0.001)
for epoch in range(10):
data_agg = list()
for batch_idx, batch in enumerate(data_loader):
data = batch['input'].to(torch.device('cuda:0'))
data = data.view(-1, 128*128)
print(data.shape)
optimizer.zero_grad()
net_out = net(data).to(torch.device('cuda:0'))
loss = criterion(net_out, torch.tensor([0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5], dtype=torch.long).to(torch.device('cuda:0')))
loss.backward()
optimizer.step()
data_agg.append(data)
data_agg_all = torch.cat(data_agg, dim=0)
data_agg_all = data_agg_all.detach().cpu().numpy()
data_agg_all = data_agg_all.reshape(-1, 128, 128)
imgFDK_agg_all = algs.fdk(data_agg_all, geo, angles, gpuids=gpuids)
Specifications
- python version: 3.8.17
- OS: Ubuntu 20.04.5 LTS
- GPU: NVIDIA GeForce RTX 3090
- CUDA version:11.4
- TIGRE version: 2.5
Hi! Thanks for the big report.
The first time we ourselves have tried pytorch and TIGRE has been few days ago in #508, so the answer is that we don't know, but we are working on having this functional soon.
Hopefully when the PR is merged we also fix this issue together.
Hi! Could you please forward the error message you get when running your script with
CUDA_LAUNCH_BLOCKING=1
, you can do that by running CUDA_LAUNCH_BLOCKING=1 python my_script.py
Also, the lines:
if torch.cuda.device_count() > 1:
net = nn.DataParallel(net, device_ids=gpus, output_device=gpus[0])
Might cause problems. From experience, using that lines with libraries that interact with the GPU in an intensive way cause problems (I had that in ASTRA+ODL). In the error trace, I see that something in torch/nn/parallel
is not happy, so if you could remove this line and tell us if anything changes that would be great :)
Hi! Could you please forward the error message you get when running your script with
CUDA_LAUNCH_BLOCKING=1
, you can do that by runningCUDA_LAUNCH_BLOCKING=1 python my_script.py
Also, the lines:if torch.cuda.device_count() > 1: net = nn.DataParallel(net, device_ids=gpus, output_device=gpus[0])
Might cause problems. From experience, using that lines with libraries that interact with the GPU in an intensive way cause problems (I had that in ASTRA+ODL). In the error trace, I see that something in
torch/nn/parallel
is not happy, so if you could remove this line and tell us if anything changes that would be great :)
I've add CUDA_LAUNCH_BLOCKING=1
in command line, and this makes no difference in error message. And torch/nn/parallel
is a must in my real code to accelerate the training process, otherwise it would run for days :-(
@huscael thanks for the test! Indeed there could be issues with DataParallel
. I don't fully understand how it works internally, but TIGRE requires the inputs to be in CPU numpy arrays. if DataParallel
puts things in the GPU, then the input to TIGRE can be a GPU array, and thus trigger an "invalid argument" error inside of it.
Not saying its this 100%, as I don't know what DataParallel does internally, but it could be.
Perhaps its related to how you pass the gpuids
to TIGRE. Can you somehow grab the current GPU from the DataParallel
isntance, and pass that one in? TIGRE will split the operation between all the GPUs in gpuids
, while DataParallel
expects that a given data point (a given instance of head
) runs in a given GPU.
@huscael thanks for the test! Indeed there could be issues with
DataParallel
. I don't fully understand how it works internally, but TIGRE requires the inputs to be in CPU numpy arrays. ifDataParallel
puts things in the GPU, then the input to TIGRE can be a GPU array, and thus trigger an "invalid argument" error inside of it.Not saying its this 100%, as I don't know what DataParallel does internally, but it could be.
this line data_agg_all = data_agg_all.detach().cpu().numpy()
ensures that the input to TIGRE is a cpu numpy array.
Perhaps its related to how you pass the
gpuids
to TIGRE. Can you somehow grab the current GPU from theDataParallel
isntance, and pass that one in? TIGRE will split the operation between all the GPUs ingpuids
, whileDataParallel
expects that a given data point (a given instance ofhead
) runs in a given GPU.
I have tried to set gpuids to [0] for both TIGRE and DataParallel, and this Invalid Argument error still occures. As of my current remedy is to use more gpus than I really need, e.g. set device id [0,1,2] for DataParallel and gpuids [3] for TIGRE. Under such circumstance this error finally disappears, but TIGRE uses very little of that gpu device, other colleages would run their programs on that gpu unintentionally and cause the same Invalid Argument error to my program. In addition, this way requires more gpu, and gpu resource is not very sufficient in my lab. So here I'm asking for a way to run pytorch TIGRE and DataParallel on the same GPU, thanks a lot!
this line
data_agg_all = data_agg_all.detach().cpu().numpy()
ensures that the input to TIGRE is a cpu numpy array.
That's for fdk, but I suspect the error is caused by the data parallel inside the data loader/dataset. the call to Ax, not the call to fdk. Dataparallels point is to put the data in gpus.
In any case, I don't know how DataParallel parallelizes the loader exactly (but it does put things on gpus, which may cause the problems as I said). As we are working on getting Tigre a bit more pytorch compatible we may find the issue, but for now the only thing I can say is that I don't know and it's not technically a supported feature, so technically not a bug.
Hopefully I can give you a better answer at some point. I'll ping you if I find an answer
I also have a similar error when I run the Tigre and torch on the same GPU. After I use tigre.Ax() to generate projections, I can no longer push any data to the GPU with torch. And it always raises a CUDA error: an illegal memory access was encountered.
I think the Tigre toolbox may change some global parameters or environments about the GPU, and cause the torch to fail to connect the GPU drive.
@ldy1995 still unsure what the issue is, but in theory TIGRE should create a context and destroy it every time its called. i.e. as opposed to pytorch that holds the GPU memory all the time, each Ax() or Atb() TIGRE call should be a new and independent call to the GPU that opens and closes the session. Clearly this is not happening, but not entirely sure why.
I will be working on making TIGRE torch compatible soon, so hopefully we can fix this. Any ideas welcome, of course.