Error "illegal memory access was encountered" during U-Net training

Question

Error "illegal memory access was encountered" during U-Net training

davidkh1 opened this issue 8 years ago · 15 comments

Thanks so much for sharing your code! I'm trying to run it from the start, but have a problem during training phase. Appreciate you support in finding a root cause.

The command I run to train U-NET, paths are adjusted for the defaults:
$ th main.lua
produces error log

Setting up data loader using data/train.h5  
Data loader setup done! 
...
Epoch : 1, Learning Rate : 1.00000  
THCudaCheck FAIL file=/home/david/torch/extra/cutorch/lib/THC/generic/THCTensorCopy.c line=81 error=77 : an illegal memory access was encountered
/home/david/torch/install/bin/luajit: cuda runtime error (77) : an illegal memory access was encountered at /home/david/torch/extra/cutorch/lib/THC/generic/THCStorage.c:147

Environment: Ubuntu 14.04, Titan X, CUDA 7.5, cuDNN v.5

Possible root causes:

I tried to temporary remove SpatialMaxPooling module, following this discussion https://groups.google.com/forum/m/#!msg/torch7/Ru-I6vP2ql0/s2vOsKoVBgAJ
Finally, I simplified the NN to include no modules, but the problem persists. So, the SpatialMaxPooling is not problematic.
I think that the created dataset in hdf5 format has some problems. I'll try to check its correctness. If you know how to check correctness, please advice.
I recently switched to cuDNN v.5. Could this version be problematic?

Thanks!

Answer 1 · 2016-08-01T13:41:07.000Z

I recently switched to cuDNN v.5. Could this version be problematic?

Did you also update cudnn torch package after upgrading?

Answer 2 · 2016-08-01T13:52:06.000Z

Yes, I even reinstalled Torch from the scratch.
I'm going to verify integrity of png image files, and then hdf5 file I've created. If no problems, I'll downgrade to cuDNN v.4 and reinstall torch with cudnn.

Answer 3 · 2016-08-01T14:40:13.000Z

I've verified the integrity of png image files with pngcheck, both train and mask images were OK.
For verifying hdf5 input file I use hdfview and h5stat. Does it look correct?

$ h5stat train.h5

Filename: train.h5
File information
    # of unique groups: 1
    # of unique datasets: 94
    # of unique named datatypes: 0
    # of unique links: 0
    # of unique other: 0
    Max. # of links to object: 1
    Max. # of objects in group: 94
...
Summary of file space information:
  File metadata: 35888 bytes
  Raw data: 10981488000 bytes
  Unaccounted space: 3888 bytes
Total space: 10981527776 bytes

How big is a memory usage during U-Net training?

Answer 4 · 2016-08-01T15:08:25.000Z

With default configuration of 64 batch size it takes around 6GB of GPU Space. And around 8 GB of Memory space.

Answer 5 · 2016-08-01T15:24:01.000Z

I have enough memory on my GPU - 12G. I tried batch size of 1, but got the same problem.

Answer 6 · 2016-08-01T15:57:59.000Z

require 'nn'
require 'cunn'

softOutCalc = nn.Sequential():add(nn.SpatialSoftMax())
softOutCalc = softOutCalc:cuda()
ips = torch.rand(8,2,80,80)
ips = ips:cuda()
softOutCalc:forward(ips)

Can you check if this piece of code works?

Answer 7 · 2016-08-01T16:26:02.000Z

The same error!
th> softOutCalc:forward(ips)

THCudaCheck FAIL file=/home/david/torch/extra/cutorch/lib/THC/generic/THCTensorCopy.c line=81 error=77 : an illegal memory access was encountered
/home/david/torch/install/bin/luajit: /home/david/torch/install/share/lua/5.1/torch/Tensor.lua:201: cuda runtime error (77) : an illegal memory access was encountered at /home/david/torch/extra/cutorch/lib/THC/generic/THCTensorCopy.c:81

Answer 8 · 2016-08-01T16:34:36.000Z

I've run test.sh from torch installation, and have FAILED tests.

 33/154 frac2 ........................................................... [PASS]
 34/154 trace ........................................................... [WAIT]{
  input : FloatTensor - size: 487x402
}
 34/154 trace ........................................................... [FAIL]

....

 96/152 SpatialDilatedConvolution_backward_single ....................... [FAIL]

Googling for this.

Answer 9 · 2016-08-01T16:42:28.000Z

These 2 failed tests seem unimportant.

Answer 10 · 2016-08-01T18:00:03.000Z

torch/cunn#292 - I had similar problem. The issue is with the installation of torch and cunn and its libraries. It got solved as we upgraded all the drivers. You can reopen the issue and let us see if we get any support.

Answer 11 · 2016-08-01T18:25:26.000Z

Could you tell the version of drivers you use, please?

Answer 12 · 2016-08-02T06:30:42.000Z

Our Setup: Ubuntu 14.04, TITAN X, CUDA 7.5, CuDNN V5 and nvidia drivers with

version:        361.45.11

Answer 13 · 2016-08-02T06:41:37.000Z

I use exactly the same setup, except older driver v. 352.93. Going to update. Thanks for your significant assist in finding the root cause of the problem. Looking forward to run U-Net training.

Answer 14 · 2016-08-02T08:03:47.000Z

I too got the same error even with the latest driver. We have reopened cunn issue.

I have made a quick fix/hack to get it working. Can you check if it works now?

Answer 15 · 2016-08-02T08:43:28.000Z

Your fix helped, without Nvidia driver update. Thanks!