quanpn90/NMTGMinor

translator leads to "CUDA error: device-side assert triggered" when using cnn downsampling for ASR

Opened this issue · 2 comments

Hi Quan,
I was trying to train an ASR system on WSJ using your toolkit using the similar setup as your ICASSP paper on the swbd corpus. If fbank features are used, the training and decoding work fine. However, when I use cnn downsampling on the fbank before the decoder, the training works fine but when I tried to decode, I got a lot of assertion failures from CUDA, for example,

data1/tools/pytorch/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = false]: block: [64,0,0], thread: [127,0,0] Assertion srcIndex < srcSelectDimSize failed.

The debugging stack is as below

File "translate.py", line 364, in <module> main() File "translate.py", line 227, in main predBatch, predScore, predLength, goldScore, numGoldWords,allGoldScores = translator.translate_asr(srcBatch, tgtBatch) File "/data1/NMTGMinor/onmt/EnsembleTranslator.py", line 464, in translate_asr for n in range(self.opt.n_best)] File "/data1/NMTGMinor/onmt/EnsembleTranslator.py", line 464, in <listcomp> for n in range(self.opt.n_best)] File "/data1/NMTGMinor/onmt/EnsembleTranslator.py", line 273, in build_target_tokens tokens = self.tgt_dict.convertToLabels(pred, onmt.Constants.EOS) File "/data1/NMTGMinor/onmt/Dict.py", line 166, in convertToLabels print('xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx', idx) File "/data1/tools/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 71, in __repr__ return torch._tensor_str._str(self) File "/data1/tools/anaconda3/lib/python3.7/site-packages/torch/_tensor_str.py", line 286, in _str tensor_str = _tensor_str(self, indent) File "/data1/tools/anaconda3/lib/python3.7/site-packages/torch/_tensor_str.py", line 201, in _tensor_str formatter = _Formatter(get_summarized_data(self) if summarize else self) File "/data1/tools/anaconda3/lib/python3.7/site-packages/torch/_tensor_str.py", line 83, in __init__ value_str = '{}'.format(value) File "/data1/tools/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 387, in __format__ return self.item().__format__(format_spec) RuntimeError: CUDA error: device-side assert triggered

Do you have any idea what causes this error? The model converges to a similar loss as the fbank features, I think it is less likely due to the model issue. Thanks

Thank you for the question.

The CNN downsampling was added not a long time ago and I was not able to test it, due to the lack of time.

Possibly the mask creation step during decoding was not done correctly. You can try decoding with batch size 1 to see if it could work.

Hi Quan,
Thanks for the reply. Right now I am using batch_size=1, I have tracked the problem: when I print out the decoder_output of the first step, only the first beam has normal log_posteriors, the log_prob for all other beams of this step are nan.

Thanks for the hint and I will look into that.