Out of memory in camb device when calling conv2d backward

Question

Out of memory in camb device when calling conv2d backward

yewentao256 opened this issue 2 years ago · 2 comments

If compiled with release mode, executing the code below will cause an oom error in camb.

import torch
import torch.nn as nn
import torch_dipu

input_data = torch.randn(2, 3, 12, 12).cuda()

conv2d_layer = nn.Conv2d(in_channels=3, out_channels=4, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), dilation=(1, 1), groups=1).cuda()

weight = torch.randn(4, 3, 7, 7).cuda()
bias = torch.zeros(4).cuda()

conv2d_layer.weight.data = weight
conv2d_layer.bias.data = bias

output = conv2d_layer(input_data)
loss = output.sum()
loss.backward()

The error is:

2023-07-21 15:14:04.069147: [cnrtError] [50200] [Card: 0] Error occurred during calling 'cnMalloc' in CNDrv interface.
2023-07-21 15:14:04.069180: [cnrtError] [50200] [Card: 0] Return value is 100100, CN_MEMORY_ERROR_OUT_OF_MEMORY.
2023-07-21 15:14:04.069188: [cnrtError] [50200] [Card: 0] cnrtMalloc: Malloc MLU device memory failed.
Traceback (most recent call last):
  File "test_conv2d.py", line 26, in <module>
    loss.backward()
  File "/mnt/lustre/share/parrotsci/github/cibuild/pytorchbase/c263bd43e8e8502d4726643bc6fd046f0130ac0e/install_path/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/mnt/lustre/share/parrotsci/github/cibuild/pytorchbase/c263bd43e8e8502d4726643bc6fd046f0130ac0e/install_path/lib/python3.8/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: call cnrt error, expr = ::cnrtGetLastError(), ret = 100100
Exception raised from checkLastError at /mnt/lustre/yewentao/dipu/torch_dipu/csrc_dipu/vendor/camb/cnrt_6.5.0/deviceimpl.cpp:54 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x58 (0x7f18f0c305c8 in /mnt/lustre/share/parrotsci/github/cibuild/pytorchbase/c263bd43e8e8502d4726643bc6fd046f0130ac0e/install_path/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xa0 (0x7f18f0c2b830 in /mnt/lustre/share/parrotsci/github/cibuild/pytorchbase/c263bd43e8e8502d4726643bc6fd046f0130ac0e/install_path/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: dipu::devapis::checkLastError() + 0x38e (0x7f1849e88ede in /mnt/lustre/yewentao/dipu/torch_dipu/libtorch_dipu.so)
frame #3: dipu::devapis::mallocDevice(void**, unsigned long, bool) + 0x43 (0x7f1849e96953 in /mnt/lustre/yewentao/dipu/torch_dipu/libtorch_dipu.so)
frame #4: dipu::native::DIPUATenFunctions::empty(c10::ArrayRef<long>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0x25b (0x7f1849e709fb in /mnt/lustre/yewentao/dipu/torch_dipu/libtorch_dipu.so)
frame #5: <unknown function> + 0x1390f8 (0x7f1849d660f8 in /mnt/lustre/yewentao/dipu/torch_dipu/libtorch_dipu.so)
frame #6: at::_ops::empty_memory_format::redispatch(c10::DispatchKeySet, c10::ArrayRef<c10::SymInt>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0x150 (0x7f18f2a3db50 in /mnt/lustre/share/parrotsci/github/cibuild/pytorchbase/c263bd43e8e8502d4726643bc6fd046f0130ac0e/install_path/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0x1ebc8df (0x7f18f2d378df in /mnt/lustre/share/parrotsci/github/cibuild/pytorchbase/c263bd43e8e8502d4726643bc6fd046f0130ac0e/install_path/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #8: at::_ops::empty_memory_format::call(c10::ArrayRef<c10::SymInt>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0x14c (0x7f18f2a733fc in /mnt/lustre/share/parrotsci/github/cibuild/pytorchbase/c263bd43e8e8502d4726643bc6fd046f0130ac0e/install_path/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #9: at::empty(c10::ArrayRef<long>, c10::TensorOptions, c10::optional<c10::MemoryFormat>) + 0x117 (0x7f1849d4af37 in /mnt/lustre/yewentao/dipu/torch_dipu/libtorch_dipu.so)
frame #10: dipu::native::dipu_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool, c10::ArrayRef<long>, long, std::array<bool, 3ul>) + 0x322 (0x7f1849db3862 in /mnt/lustre/yewentao/dipu/torch_dipu/libtorch_dipu.so)
frame #11: dipu::native::DipuConv2dFunction::backward(torch::autograd::AutogradContext*, std::vector<at::Tensor, std::allocator<at::Tensor> >) + 0x4aa (0x7f1849e5530a in /mnt/lustre/yewentao/dipu/torch_dipu/libtorch_dipu.so)
frame #12: torch::autograd::CppNode<dipu::native::DipuConv2dFunction>::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0x233 (0x7f1849e65943 in /mnt/lustre/yewentao/dipu/torch_dipu/libtorch_dipu.so)
frame #13: <unknown function> + 0x3c5dbd7 (0x7f18f4ad8bd7 in /mnt/lustre/share/parrotsci/github/cibuild/pytorchbase/c263bd43e8e8502d4726643bc6fd046f0130ac0e/install_path/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #14: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) + 0x1145 (0x7f18f4ad4085 in /mnt/lustre/share/parrotsci/github/cibuild/pytorchbase/c263bd43e8e8502d4726643bc6fd046f0130ac0e/install_path/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #15: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) + 0x5a9 (0x7f18f4ad4f39 in /mnt/lustre/share/parrotsci/github/cibuild/pytorchbase/c263bd43e8e8502d4726643bc6fd046f0130ac0e/install_path/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #16: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x89 (0x7f18f4acb509 in /mnt/lustre/share/parrotsci/github/cibuild/pytorchbase/c263bd43e8e8502d4726643bc6fd046f0130ac0e/install_path/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #17: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x53 (0x7f18fba03d93 in /mnt/lustre/share/parrotsci/github/cibuild/pytorchbase/c263bd43e8e8502d4726643bc6fd046f0130ac0e/install_path/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #18: <unknown function> + 0xb9f4f (0x7f1903bf2f4f in /mnt/lustre/share/platform/dep/gcc-7.5/lib64/libstdc++.so.6)
frame #19: <unknown function> + 0x7dd5 (0x7f1903924dd5 in /usr/lib64/libpthread.so.0)
frame #20: clone + 0x6d (0x7f1902f44ead in /usr/lib64/libc.so.6)

Answer 1 · 2023-07-21T10:06:50.000Z

https://github.com/DeepLink-org/DIPU/pull/209/files 该pr已经修复该问题

Answer 2 · 2023-07-24T06:43:25.000Z

已定位到根本原因：

在pytorch中，intArrayRef不持有数据，它只是一个引用，依赖现有数据生命周期大于它，所以不能够使用c10::IntArrayRef bias_size = { grad_output.size(1) };的方式创建，感谢修复