RuntimeError: CUDA error: the launch timed out and was terminated

Question

RuntimeError: CUDA error: the launch timed out and was terminated

mriffle opened this issue 6 months ago · 3 comments

I am running casanovo (4.1.0) using an older GPU (NVIDIA 960) on Windows 10 Pro (under WSL2 & Docker) and was encountering this error: RuntimeError: CUDA error: the launch timed out and was terminated I will post the entire stack trace below. I honestly don't know if it's related to using an old GPU or not.

I was able to fix this by setting this environment variable before running casanovo: CUDA_LAUNCH_BLOCKING=1

I was reading PyTorch forums, and they had this to say about this:

Programmers can globally disable asynchronicity of kernel launches for all CUDA applications running on a system by setting the CUDA_LAUNCH_BLOCKING environment variable to 1. This feature is provided for debugging purposes only and should not be used as a way to make production software run reliably

So while this does work, I thought I would mention it since I shouldn't use this as a way to make software run reliably.

Full stack trace(s):

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 903, in _predict_impl
    results = self._run(model, ckpt_path=ckpt_path)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 987, in _run
    results = self._run_stage()
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1028, in _run_stage
    return self.predict_loop.run()
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/utilities.py", line 182, in _decorator
    return loop_run(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/prediction_loop.py", line 124, in run
    self._predict_step(batch, batch_idx, dataloader_idx, dataloader_iter)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/prediction_loop.py", line 253, in _predict_step
    predictions = call._call_strategy_hook(trainer, "predict_step", *step_args)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 309, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 438, in predict_step
    return self.lightning_module.predict_step(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/casanovo/denovo/model.py", line 820, in predict_step
    self.forward(batch[0], batch[1]),
  File "/usr/local/lib/python3.10/site-packages/casanovo/denovo/model.py", line 200, in forward
    return self.beam_search_decode(
  File "/usr/local/lib/python3.10/site-packages/casanovo/denovo/model.py", line 230, in beam_search_decode
    memories, mem_masks = self.encoder(spectra)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/depthcharge/components/transformers.py", line 105, in forward
    return self.transformer_encoder(peaks, src_key_padding_mask=mask), mask
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/transformer.py", line 391, in forward
    output = mod(output, src_mask=mask, is_causal=is_causal, src_key_padding_mask=src_key_padding_mask_for_layers)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/transformer.py", line 685, in forward
    return torch._transformer_encoder_layer_fwd(
RuntimeError: CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/casanovo", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/rich_click/rich_command.py", line 126, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/casanovo/casanovo.py", line 143, in sequence
    runner.predict(peak_path, output)
  File "/usr/local/lib/python3.10/site-packages/casanovo/denovo/model_runner.py", line 164, in predict
    self.trainer.predict(self.model, self.loaders.test_dataloader())
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 864, in predict
    return call._call_and_handle_interrupt(
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 68, in _call_and_handle_interrupt
    trainer._teardown()
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1010, in _teardown
    self.strategy.teardown()
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 537, in teardown
    self.lightning_module.cpu()
  File "/usr/local/lib/python3.10/site-packages/lightning/fabric/utilities/device_dtype_mixin.py", line 82, in cpu
    return super().cpu()
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 960, in cpu
    return self._apply(lambda t: t.cpu())
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 849, in _apply
    self._buffers[key] = fn(buf)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 960, in <lambda>
    return self._apply(lambda t: t.cpu())
RuntimeError: CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb9b0781d87 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fb9b073275f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fb9b0ba58a8 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0xfa5045 (0x7fb9660f1045 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x540210 (0x7fb9af0dd210 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x649bf (0x7fb9b07669bf in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x21b (0x7fb9b075fc8b in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7fb9b075fe39 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: <unknown function> + 0x802b98 (0x7fb9af39fb98 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object*) + 0x2f6 (0x7fb9af39ff16 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #28: <unknown function> + 0x2724a (0x7fb9b168f24a in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #29: __libc_start_main + 0x85 (0x7fb9b168f305 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #30: _start + 0x21 (0x559af8c3c071 in /usr/local/bin/python)

Answer 1 · 2024-07-17T06:29:15.000Z

Which PyTorch and CUDA version are you using? I assume that the GPU is detected as functional, otherwise it shouldn't be trying to run CUDA-related stuff, but can you post the full log to be sure?

I strongly suspect that the problem is indeed the rather old GPU, and this is not something we can/need to fix on the Casanovo side.

Answer 2 · 2024-07-17T15:52:20.000Z

Pytorch: 2.2.2+cu121
CUDA: 12.2.79

The GPU is detected as functional and I can see it being used. I agree the problem is likely the old GPU. I figured it was worth posting in case someone else ran into this issue and needed a solution.

Here is the standard error and log from the failed run (it got to about 6% done):

Standard Err:

Seed set to 454
INFO: Casanovo version 4.1.0
INFO: Sequencing peptides from:
INFO:   Ecl_2022_1214_neo_pepsep_150um_30cm_1-9_TRX-DDA-Xray-Hs27-T1-FR0_28.mzML
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
INFO: Reading 1 files...
^MEcl_2022_1214_neo_pepsep_150um_30cm_1-9_TRX-DDA-Xray-Hs27-T1-FR0_28.mzML:   0%|

Here is the log:

2024-07-16 23:55:53,792 INFO [casanovo/MainProcess] casanovo.setup_model : Casanovo version 4.1.0
2024-07-16 23:55:53,792 DEBUG [casanovo/MainProcess] casanovo.setup_model : model = casanovo_v4_2_0.ckpt
2024-07-16 23:55:53,792 DEBUG [casanovo/MainProcess] casanovo.setup_model : config = casanovo.yaml
2024-07-16 23:55:53,792 DEBUG [casanovo/MainProcess] casanovo.setup_model : output = /home/mriffle/casanovo-tests/work/a7/5f52321e71d1d1f2e16f321fd1e69a/results.mztab
2024-07-16 23:55:53,792 DEBUG [casanovo/MainProcess] casanovo.setup_model : precursor_mass_tol = 50.0
2024-07-16 23:55:53,793 DEBUG [casanovo/MainProcess] casanovo.setup_model : isotope_error_range = (0, 1)
2024-07-16 23:55:53,793 DEBUG [casanovo/MainProcess] casanovo.setup_model : min_peptide_len = 6
2024-07-16 23:55:53,793 DEBUG [casanovo/MainProcess] casanovo.setup_model : predict_batch_size = 1024
2024-07-16 23:55:53,793 DEBUG [casanovo/MainProcess] casanovo.setup_model : n_beams = 1
2024-07-16 23:55:53,793 DEBUG [casanovo/MainProcess] casanovo.setup_model : top_match = 1
2024-07-16 23:55:53,793 DEBUG [casanovo/MainProcess] casanovo.setup_model : accelerator = auto
2024-07-16 23:55:53,793 DEBUG [casanovo/MainProcess] casanovo.setup_model : devices = None
2024-07-16 23:55:53,793 DEBUG [casanovo/MainProcess] casanovo.setup_model : random_seed = 454
2024-07-16 23:55:53,793 DEBUG [casanovo/MainProcess] casanovo.setup_model : n_log = 1
2024-07-16 23:55:53,793 DEBUG [casanovo/MainProcess] casanovo.setup_model : tb_summarywriter = None
2024-07-16 23:55:53,793 DEBUG [casanovo/MainProcess] casanovo.setup_model : save_top_k = 5
2024-07-16 23:55:53,793 DEBUG [casanovo/MainProcess] casanovo.setup_model : model_save_folder_path =
2024-07-16 23:55:53,793 DEBUG [casanovo/MainProcess] casanovo.setup_model : val_check_interval = 50000
2024-07-16 23:55:53,793 DEBUG [casanovo/MainProcess] casanovo.setup_model : n_peaks = 150
2024-07-16 23:55:53,793 DEBUG [casanovo/MainProcess] casanovo.setup_model : min_mz = 50.0
2024-07-16 23:55:53,793 DEBUG [casanovo/MainProcess] casanovo.setup_model : max_mz = 2500.0
2024-07-16 23:55:53,793 DEBUG [casanovo/MainProcess] casanovo.setup_model : min_intensity = 0.01
2024-07-16 23:55:53,793 DEBUG [casanovo/MainProcess] casanovo.setup_model : remove_precursor_tol = 2.0
2024-07-16 23:55:53,794 DEBUG [casanovo/MainProcess] casanovo.setup_model : max_charge = 10
2024-07-16 23:55:53,794 DEBUG [casanovo/MainProcess] casanovo.setup_model : dim_model = 512
2024-07-16 23:55:53,794 DEBUG [casanovo/MainProcess] casanovo.setup_model : n_head = 8
2024-07-16 23:55:53,794 DEBUG [casanovo/MainProcess] casanovo.setup_model : dim_feedforward = 1024
2024-07-16 23:55:53,794 DEBUG [casanovo/MainProcess] casanovo.setup_model : n_layers = 9
2024-07-16 23:55:53,794 DEBUG [casanovo/MainProcess] casanovo.setup_model : dropout = 0.0
2024-07-16 23:55:53,794 DEBUG [casanovo/MainProcess] casanovo.setup_model : dim_intensity = None
2024-07-16 23:55:53,794 DEBUG [casanovo/MainProcess] casanovo.setup_model : max_length = 100
2024-07-16 23:55:53,794 DEBUG [casanovo/MainProcess] casanovo.setup_model : warmup_iters = 100000
2024-07-16 23:55:53,794 DEBUG [casanovo/MainProcess] casanovo.setup_model : max_iters = 600000
2024-07-16 23:55:53,794 DEBUG [casanovo/MainProcess] casanovo.setup_model : learning_rate = 0.0005
2024-07-16 23:55:53,794 DEBUG [casanovo/MainProcess] casanovo.setup_model : weight_decay = 1e-05
2024-07-16 23:55:53,794 DEBUG [casanovo/MainProcess] casanovo.setup_model : train_label_smoothing = 0.01
2024-07-16 23:55:53,794 DEBUG [casanovo/MainProcess] casanovo.setup_model : train_batch_size = 32
2024-07-16 23:55:53,794 DEBUG [casanovo/MainProcess] casanovo.setup_model : max_epochs = 30
2024-07-16 23:55:53,794 DEBUG [casanovo/MainProcess] casanovo.setup_model : num_sanity_val_steps = 0
2024-07-16 23:55:53,794 DEBUG [casanovo/MainProcess] casanovo.setup_model : calculate_precision = False
2024-07-16 23:55:53,794 DEBUG [casanovo/MainProcess] casanovo.setup_model : residues = {'G': 57.021464, 'A': 71.037114, 'S': 87.032028, 'P': 97.052764, 'V': 99.068414, 'T': 101.04767, 'C+57.021': 160.030649, 'L': 113.084064, 'I': 113.084064, 'N': 114.042927, 'D': 115.026943, 'Q': 128.058578, 'K': 128.094963, 'E': 129.042593, 'M': 131.040485, 'H': 137.058912, 'F': 147.068414, 'R': 156.101111, 'Y': 163.063329, 'W': 186.079313, 'M+15.995': 147.0354, 'N+0.984': 115.026943, 'Q+0.984': 129.042594, '+42.011': 42.010565, '+43.006': 43.005814, '-17.027': -17.026549, '+43.006-17.027': 25.980265}
2024-07-16 23:55:53,794 DEBUG [casanovo/MainProcess] casanovo.setup_model : n_workers = 24
2024-07-16 23:55:53,848 INFO [casanovo/MainProcess] casanovo.sequence : Sequencing peptides from:
2024-07-16 23:55:53,848 INFO [casanovo/MainProcess] casanovo.sequence :   Ecl_2022_1214_neo_pepsep_150um_30cm_1-9_TRX-DDA-Xray-Hs27-T1-FR0_28.mzML
2024-07-16 23:55:55,418 INFO [depthcharge.data.hdf5/MainProcess] hdf5.__init__ : Reading 1 files...

Answer 3 · 2024-07-18T07:58:28.000Z

As discussed, this is due to the older GPU and not something we can fix on the Casanovo side. Users that have the same problem can try setting the TORCH_USE_CUDA_DSA environment variable, which might fix the issue at the expense of a loss of performance.