cudaMemcpyAsync throws exception in GPUDataTransfer

Question

cudaMemcpyAsync throws exception in GPUDataTransfer

laxnpander opened this issue 8 months ago · 5 comments

Describe the issue

Hey all,

I have an issue running the following model:
https://github.com/fabio-sim/LightGlue-ONNX
More specific this onnx:
https://github.com/fabio-sim/LightGlue-ONNX/releases/download/v1.0.0/superpoint_lightglue_end2end_fused.onnx

Verbose log:
verbose_log.txt

CUDA throws an exception async copying data. According to the verbose log it always seems to happen at Kernel with idx 2478.
The stack trace looks as follows:

<unknown> 0x00007fffd9970935
<unknown> 0x00007fffd9a5d86a
<unknown> 0x00007fffd9b914cb
<unknown> 0x00007fffd9b91d61
<unknown> 0x00007fffd9cb9130
<unknown> 0x00007fffd9931a33
<unknown> 0x00007fffd9931f41
<unknown> 0x00007fffd9932ea8
<unknown> 0x00007fffd9b000d1
<unknown> 0x00007fffdb644459
<unknown> 0x00007fffdb6176fd
cudaMemcpyAsync 0x00007fffdb6696a5
onnxruntime::GPUDataTransfer::CopyTensorAsync(onnxruntime::Tensor const&, onnxruntime::Tensor&, onnxruntime::Stream&) const 0x00007fff9fd1b0dd
onnxruntime::IDataTransfer::CopyTensors(std::vector<onnxruntime::IDataTransfer::SrcDstPair, std::allocator<onnxruntime::IDataTransfer::SrcDstPair> > const&) const 0x00007ffff6dbbe63
onnxruntime::ProviderHostImpl::IDataTransfer__CopyTensors(onnxruntime::IDataTransfer const*, std::vector<onnxruntime::IDataTransfer::SrcDstPair, std::allocator<onnxruntime::IDataTransfer::SrcDstPair> > const&) 0x00007ffff66406a8
onnxruntime::IDataTransfer::CopyTensors(std::vector<onnxruntime::IDataTransfer::SrcDstPair, std::allocator<onnxruntime::IDataTransfer::SrcDstPair> > const&) const 0x00007fff9ff35bc7
onnxruntime::DataTransferManager::CopyTensors(std::vector<onnxruntime::IDataTransfer::SrcDstPair, std::allocator<onnxruntime::IDataTransfer::SrcDstPair> > const&) const 0x00007ffff6dbf95d
onnxruntime::utils::ExecuteGraphImpl(onnxruntime::SessionState const&, onnxruntime::FeedsFetchesManager const&, gsl::span<OrtValue const, 18446744073709551615ul>, std::vector<OrtValue, std::allocator<OrtValue> >&, std::unordered_map<unsigned long, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtDevice const&, OrtValue&, bool&)>, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long const, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtDevice const&, OrtValue&, bool&)> > > > const&, ExecutionMode, bool const&, onnxruntime::logging::Logger const&, onnxruntime::DeviceStreamCollection*, bool, onnxruntime::Stream*) 0x00007ffff6e65802
onnxruntime::utils::ExecuteGraph(onnxruntime::SessionState const&, onnxruntime::FeedsFetchesManager&, gsl::span<OrtValue const, 18446744073709551615ul>, std::vector<OrtValue, std::allocator<OrtValue> >&, ExecutionMode, bool const&, onnxruntime::logging::Logger const&, onnxruntime::DeviceStreamCollectionHolder&, bool, onnxruntime::Stream*) 0x00007ffff6e66e8b
onnxruntime::utils::ExecuteGraph(onnxruntime::SessionState const&, onnxruntime::FeedsFetchesManager&, gsl::span<OrtValue const, 18446744073709551615ul>, std::vector<OrtValue, std::allocator<OrtValue> >&, ExecutionMode, OrtRunOptions const&, onnxruntime::DeviceStreamCollectionHolder&, onnxruntime::logging::Logger const&) 0x00007ffff6e671f3
onnxruntime::InferenceSession::Run(OrtRunOptions const&, gsl::span<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, 18446744073709551615ul>, gsl::span<OrtValue const, 18446744073709551615ul>, gsl::span<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, 18446744073709551615ul>, std::vector<OrtValue, std::allocator<OrtValue> >*, std::vector<OrtDevice, std::allocator<OrtDevice> > const*) [clone .localalias] 0x00007ffff668ac8a
onnxruntime::InferenceSession::Run(OrtRunOptions const&, gsl::span<char const* const, 18446744073709551615ul>, gsl::span<OrtValue const* const, 18446744073709551615ul>, gsl::span<char const* const, 18446744073709551615ul>, gsl::span<OrtValue*, 18446744073709551615ul>) 0x00007ffff668bab2
OrtApis::Run(OrtSession*, OrtRunOptions const*, char const* const*, OrtValue const* const*, unsigned long, char const* const*, unsigned long, OrtValue**) 0x00007ffff6613fff
Ort::detail::SessionImpl::Run onnxruntime_cxx_inline.h:967
spear::ort::Inference::run Inference.h:314
main superpoint_lightglue_main.cpp:67
__libc_start_call_main 0x00007ffff5c29d90
__libc_start_main_impl 0x00007ffff5c29e40
_start 0x0000555555558d55

Test report also throws some errors. Not sure if it is related:

[----------] Global test environment tear-down
[==========] 3957 tests from 279 test suites ran. (260820 ms total)
[  PASSED  ] 3935 tests.
[  SKIPPED ] 2 tests, listed below:
[  SKIPPED ] MatMulFpQ4.MatMul2DSym
[  SKIPPED ] MatMulFpQ4.MatMul2DBlkZp
[  FAILED  ] 20 tests, listed below:
[  FAILED  ] QOrderedTest.Attention_WithData_ROW_ORDER
[  FAILED  ] QOrderedTest.LongformerAttention_1x128x2x16_window_32
[  FAILED  ] QOrderedTest.MatMul_COL_16x64x32
[  FAILED  ] QOrderedTest.MatMul_COL_16x64x32_perchannel
[  FAILED  ] QOrderedTest.MatMul_bias_COL_16x64x32
[  FAILED  ] QOrderedTest.MatMul_bias_COL_16x64x32_perchannel
[  FAILED  ] QOrderedTest.MatMul_addC_COL_16x64x32
[  FAILED  ] QOrderedTest.MatMul_addC_COL_16x64x32_perchannel
[  FAILED  ] QOrderedTest.MatMul_bias_addC_COL_16x64x32
[  FAILED  ] QOrderedTest.MatMul_bias_addC_COL_16x64x32_perchannel
[  FAILED  ] QOrderedTest.MatMul_COL_16x64x32_b3_1
[  FAILED  ] QOrderedTest.MatMul_bias_COL_16x64x32_b2_1
[  FAILED  ] QOrderedTest.MatMul_bias_COL_16x64x32_b2_1_perchannel
[  FAILED  ] QOrderedTest.MatMul_addC_COL_16x64x32_b2_1
[  FAILED  ] QOrderedTest.MatMul_addC_COL_16x64x32_b2_1_perchannel
[  FAILED  ] QOrderedTest.MatMul_addC_broadcastC_COL_16x64x32_b2_1
[  FAILED  ] QOrderedTest.MatMul_addC_bias_COL_16x64x32_b2_1
[  FAILED  ] QOrderedTest.MatMul_addC_bias_COL_16x64x32_b2_1_perchannel
[  FAILED  ] QOrderedTest.MatMul_bias_addC_broadcastC_COL_16x64x32_b2_1
[  FAILED  ] QOrderedTest.MatMul_bias_addC_broadcastC_COL_16x64x32_b2_1_perchannel

To reproduce

Load model into onnxruntime, set two images as input, run the inference in C++.

Urgency

No

Platform

Linux

OS Version

Ubuntu 22.04

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

16.3

ONNX Runtime API

C++

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 11.8