Error: Cublas Error: 13

Question

Error: Cublas Error: 13

jie-tu914 opened this issue 3 years ago · 0 comments

Bug description

Please add a clear and concise description of the bug, including observed and if possible expected behavior.
When i tried to use Marian-dev to train a RNNmodel,there seems to be a bug that is Error: Cublas Error: 13 - /home/moses/moses/tj/marian/marian/src/tensors/gpu/prod.cpp:118: cublasGemmEx(handle, transa, transb, m, n, k, alpha, A, CUDA_R_32F, lda, B, CUDA_R_32F, ldb, beta, C, CUDA_R_32F, ldc, CUDA_R_32F, algorithm),i didn't have any changes in code.

the version of my marian is v1.11.0,cuda is cuda 10.1

The log error shows:

[2022-02-27 14:16:46] [marian] Marian v1.11.0 f00d062 2022-02-08 08:39:24 -0800
[2022-02-27 14:16:46] [marian] Running on moses-Precision-Tower-7910 as process 7754 with command line:
[2022-02-27 14:16:46] [marian] /home/moses/moses/tj/marian/marian/build/marian --sync-sgd --model model/model.npz -T . --devices 0 --train-sets data/train.bpe.de data/train.bpe.en --vocabs data/train.bpe.de.json data/train.bpe.en.json --mini-batch-fit -w 3000 --dim-vocabs 50000 50000 --layer-normalization --dropout-rnn 0.2 --dropout-src 0.1 --dropout-trg 0.1 --learn-rate 0.0001 --after-epochs 0 --early-stopping 5 --valid-freq 20000 --save-freq 20000 --disp-freq 2000 --valid-mini-batch 8 --valid-sets data/dev.bpe.de data/dev.bpe.en --valid-metrics cross-entropy perplexity translation --valid-translation-output model/dev.out --valid-script-path ./score-dev.sh --seed 1111 --exponential-smoothing --normalize=1 --beam-size=12 --quiet-translation --log model/train.log --valid-log model/valid.log
[2022-02-27 14:16:46] [config] after: 0e
[2022-02-27 14:16:46] [config] after-batches: 0
[2022-02-27 14:16:46] [config] after-epochs: 0
[2022-02-27 14:16:46] [config] all-caps-every: 0
[2022-02-27 14:16:46] [config] allow-unk: false
[2022-02-27 14:16:46] [config] authors: false
[2022-02-27 14:16:46] [config] beam-size: 12
[2022-02-27 14:16:46] [config] bert-class-symbol: "[CLS]"
[2022-02-27 14:16:46] [config] bert-mask-symbol: "[MASK]"
[2022-02-27 14:16:46] [config] bert-masking-fraction: 0.15
[2022-02-27 14:16:46] [config] bert-sep-symbol: "[SEP]"
[2022-02-27 14:16:46] [config] bert-train-type-embeddings: true
[2022-02-27 14:16:46] [config] bert-type-vocab-size: 2
[2022-02-27 14:16:46] [config] build-info: ""
[2022-02-27 14:16:46] [config] check-gradient-nan: false
[2022-02-27 14:16:46] [config] check-nan: false
[2022-02-27 14:16:46] [config] cite: false
[2022-02-27 14:16:46] [config] clip-norm: 1
[2022-02-27 14:16:46] [config] cost-scaling:
[2022-02-27 14:16:46] [config] []
[2022-02-27 14:16:46] [config] cost-type: ce-sum
[2022-02-27 14:16:46] [config] cpu-threads: 0
[2022-02-27 14:16:46] [config] data-threads: 8
[2022-02-27 14:16:46] [config] data-weighting: ""
[2022-02-27 14:16:46] [config] data-weighting-type: sentence
[2022-02-27 14:16:46] [config] dec-cell: gru
[2022-02-27 14:16:46] [config] dec-cell-base-depth: 2
[2022-02-27 14:16:46] [config] dec-cell-high-depth: 1
[2022-02-27 14:16:46] [config] dec-depth: 1
[2022-02-27 14:16:46] [config] devices:
[2022-02-27 14:16:46] [config] - 0
[2022-02-27 14:16:46] [config] dim-emb: 512
[2022-02-27 14:16:46] [config] dim-rnn: 1024
[2022-02-27 14:16:46] [config] dim-vocabs:
[2022-02-27 14:16:46] [config] - 50000
[2022-02-27 14:16:46] [config] - 50000
[2022-02-27 14:16:46] [config] disp-first: 0
[2022-02-27 14:16:46] [config] disp-freq: 2000
[2022-02-27 14:16:46] [config] disp-label-counts: true
[2022-02-27 14:16:46] [config] dropout-rnn: 0.2
[2022-02-27 14:16:46] [config] dropout-src: 0.1
[2022-02-27 14:16:46] [config] dropout-trg: 0.1
[2022-02-27 14:16:46] [config] dump-config: ""
[2022-02-27 14:16:46] [config] dynamic-gradient-scaling:
[2022-02-27 14:16:46] [config] []
[2022-02-27 14:16:46] [config] early-stopping: 5
[2022-02-27 14:16:46] [config] early-stopping-on: first
[2022-02-27 14:16:46] [config] embedding-fix-src: false
[2022-02-27 14:16:46] [config] embedding-fix-trg: false
[2022-02-27 14:16:46] [config] embedding-normalization: false
[2022-02-27 14:16:46] [config] embedding-vectors:
[2022-02-27 14:16:46] [config] []
[2022-02-27 14:16:46] [config] enc-cell: gru
[2022-02-27 14:16:46] [config] enc-cell-depth: 1
[2022-02-27 14:16:46] [config] enc-depth: 1
[2022-02-27 14:16:46] [config] enc-type: bidirectional
[2022-02-27 14:16:46] [config] english-title-case-every: 0
[2022-02-27 14:16:46] [config] exponential-smoothing: 0.0001
[2022-02-27 14:16:46] [config] factor-weight: 1
[2022-02-27 14:16:46] [config] factors-combine: sum
[2022-02-27 14:16:46] [config] factors-dim-emb: 0
[2022-02-27 14:16:46] [config] gradient-checkpointing: false
[2022-02-27 14:16:46] [config] gradient-norm-average-window: 100
[2022-02-27 14:16:46] [config] guided-alignment: none
[2022-02-27 14:16:46] [config] guided-alignment-cost: mse
[2022-02-27 14:16:46] [config] guided-alignment-weight: 0.1
[2022-02-27 14:16:46] [config] ignore-model-config: false
[2022-02-27 14:16:46] [config] input-types:
[2022-02-27 14:16:46] [config] []
[2022-02-27 14:16:46] [config] interpolate-env-vars: false
[2022-02-27 14:16:46] [config] keep-best: false
[2022-02-27 14:16:46] [config] label-smoothing: 0
[2022-02-27 14:16:46] [config] layer-normalization: true
[2022-02-27 14:16:46] [config] learn-rate: 0.0001
[2022-02-27 14:16:46] [config] lemma-dependency: ""
[2022-02-27 14:16:46] [config] lemma-dim-emb: 0
[2022-02-27 14:16:46] [config] log: model/train.log
[2022-02-27 14:16:46] [config] log-level: info
[2022-02-27 14:16:46] [config] log-time-zone: ""
[2022-02-27 14:16:46] [config] logical-epoch:
[2022-02-27 14:16:46] [config] - 1e
[2022-02-27 14:16:46] [config] - 0
[2022-02-27 14:16:46] [config] lr-decay: 0
[2022-02-27 14:16:46] [config] lr-decay-freq: 50000
[2022-02-27 14:16:46] [config] lr-decay-inv-sqrt:
[2022-02-27 14:16:46] [config] - 0
[2022-02-27 14:16:46] [config] lr-decay-repeat-warmup: false
[2022-02-27 14:16:46] [config] lr-decay-reset-optimizer: false
[2022-02-27 14:16:46] [config] lr-decay-start:
[2022-02-27 14:16:46] [config] - 10
[2022-02-27 14:16:46] [config] - 1
[2022-02-27 14:16:46] [config] lr-decay-strategy: epoch+stalled
[2022-02-27 14:16:46] [config] lr-report: false
[2022-02-27 14:16:46] [config] lr-warmup: 0
[2022-02-27 14:16:46] [config] lr-warmup-at-reload: false
[2022-02-27 14:16:46] [config] lr-warmup-cycle: false
[2022-02-27 14:16:46] [config] lr-warmup-start-rate: 0
[2022-02-27 14:16:46] [config] max-length: 50
[2022-02-27 14:16:46] [config] max-length-crop: false
[2022-02-27 14:16:46] [config] max-length-factor: 3
[2022-02-27 14:16:46] [config] maxi-batch: 100
[2022-02-27 14:16:46] [config] maxi-batch-sort: trg
[2022-02-27 14:16:46] [config] mini-batch: 64
[2022-02-27 14:16:46] [config] mini-batch-fit: true
[2022-02-27 14:16:46] [config] mini-batch-fit-step: 10
[2022-02-27 14:16:46] [config] mini-batch-round-up: true
[2022-02-27 14:16:46] [config] mini-batch-track-lr: false
[2022-02-27 14:16:46] [config] mini-batch-warmup: 0
[2022-02-27 14:16:46] [config] mini-batch-words: 0
[2022-02-27 14:16:46] [config] mini-batch-words-ref: 0
[2022-02-27 14:16:46] [config] model: model/model.npz
[2022-02-27 14:16:46] [config] multi-loss-type: sum
[2022-02-27 14:16:46] [config] n-best: false
[2022-02-27 14:16:46] [config] no-nccl: false
[2022-02-27 14:16:46] [config] no-reload: false
[2022-02-27 14:16:46] [config] no-restore-corpus: false
[2022-02-27 14:16:46] [config] normalize: 1
[2022-02-27 14:16:46] [config] normalize-gradient: false
[2022-02-27 14:16:46] [config] num-devices: 0
[2022-02-27 14:16:46] [config] optimizer: adam
[2022-02-27 14:16:46] [config] optimizer-delay: 1
[2022-02-27 14:16:46] [config] optimizer-params:
[2022-02-27 14:16:46] [config] []
[2022-02-27 14:16:46] [config] output-omit-bias: false
[2022-02-27 14:16:46] [config] overwrite: false
[2022-02-27 14:16:46] [config] precision:
[2022-02-27 14:16:46] [config] - float32
[2022-02-27 14:16:46] [config] - float32
[2022-02-27 14:16:46] [config] pretrained-model: ""
[2022-02-27 14:16:46] [config] quantize-biases: false
[2022-02-27 14:16:46] [config] quantize-bits: 0
[2022-02-27 14:16:46] [config] quantize-log-based: false
[2022-02-27 14:16:46] [config] quantize-optimization-steps: 0
[2022-02-27 14:16:46] [config] quiet: false
[2022-02-27 14:16:46] [config] quiet-translation: true
[2022-02-27 14:16:46] [config] relative-paths: false
[2022-02-27 14:16:46] [config] right-left: false
[2022-02-27 14:16:46] [config] save-freq: 20000
[2022-02-27 14:16:46] [config] seed: 1111
[2022-02-27 14:16:46] [config] sentencepiece-alphas:
[2022-02-27 14:16:46] [config] []
[2022-02-27 14:16:46] [config] sentencepiece-max-lines: 2000000
[2022-02-27 14:16:46] [config] sentencepiece-options: ""
[2022-02-27 14:16:46] [config] sharding: global
[2022-02-27 14:16:46] [config] shuffle: data
[2022-02-27 14:16:46] [config] shuffle-in-ram: false
[2022-02-27 14:16:46] [config] sigterm: save-and-exit
[2022-02-27 14:16:46] [config] skip: false
[2022-02-27 14:16:46] [config] sqlite: ""
[2022-02-27 14:16:46] [config] sqlite-drop: false
[2022-02-27 14:16:46] [config] sync-freq: 200u
[2022-02-27 14:16:46] [config] sync-sgd: true
[2022-02-27 14:16:46] [config] tempdir: .
[2022-02-27 14:16:46] [config] tied-embeddings: false
[2022-02-27 14:16:46] [config] tied-embeddings-all: false
[2022-02-27 14:16:46] [config] tied-embeddings-src: false
[2022-02-27 14:16:46] [config] train-embedder-rank:
[2022-02-27 14:16:46] [config] []
[2022-02-27 14:16:46] [config] train-sets:
[2022-02-27 14:16:46] [config] - data/train.bpe.de
[2022-02-27 14:16:46] [config] - data/train.bpe.en
[2022-02-27 14:16:46] [config] transformer-aan-activation: swish
[2022-02-27 14:16:46] [config] transformer-aan-depth: 2
[2022-02-27 14:16:46] [config] transformer-aan-nogate: false
[2022-02-27 14:16:46] [config] transformer-decoder-autoreg: self-attention
[2022-02-27 14:16:46] [config] transformer-decoder-dim-ffn: 0
[2022-02-27 14:16:46] [config] transformer-decoder-ffn-depth: 0
[2022-02-27 14:16:46] [config] transformer-depth-scaling: false
[2022-02-27 14:16:46] [config] transformer-dim-aan: 2048
[2022-02-27 14:16:46] [config] transformer-dim-ffn: 2048
[2022-02-27 14:16:46] [config] transformer-dropout: 0
[2022-02-27 14:16:46] [config] transformer-dropout-attention: 0
[2022-02-27 14:16:46] [config] transformer-dropout-ffn: 0
[2022-02-27 14:16:46] [config] transformer-ffn-activation: swish
[2022-02-27 14:16:46] [config] transformer-ffn-depth: 2
[2022-02-27 14:16:46] [config] transformer-guided-alignment-layer: last
[2022-02-27 14:16:46] [config] transformer-heads: 8
[2022-02-27 14:16:46] [config] transformer-no-projection: false
[2022-02-27 14:16:46] [config] transformer-pool: false
[2022-02-27 14:16:46] [config] transformer-postprocess: dan
[2022-02-27 14:16:46] [config] transformer-postprocess-emb: d
[2022-02-27 14:16:46] [config] transformer-postprocess-top: ""
[2022-02-27 14:16:46] [config] transformer-preprocess: ""
[2022-02-27 14:16:46] [config] transformer-tied-layers:
[2022-02-27 14:16:46] [config] []
[2022-02-27 14:16:46] [config] transformer-train-position-embeddings: false
[2022-02-27 14:16:46] [config] tsv: false
[2022-02-27 14:16:46] [config] tsv-fields: 0
[2022-02-27 14:16:46] [config] type: amun
[2022-02-27 14:16:46] [config] ulr: false
[2022-02-27 14:16:46] [config] ulr-dim-emb: 0
[2022-02-27 14:16:46] [config] ulr-dropout: 0
[2022-02-27 14:16:46] [config] ulr-keys-vectors: ""
[2022-02-27 14:16:46] [config] ulr-query-vectors: ""
[2022-02-27 14:16:46] [config] ulr-softmax-temperature: 1
[2022-02-27 14:16:46] [config] ulr-trainable-transformation: false
[2022-02-27 14:16:46] [config] unlikelihood-loss: false
[2022-02-27 14:16:46] [config] valid-freq: 20000
[2022-02-27 14:16:46] [config] valid-log: model/valid.log
[2022-02-27 14:16:46] [config] valid-max-length: 1000
[2022-02-27 14:16:46] [config] valid-metrics:
[2022-02-27 14:16:46] [config] - cross-entropy
[2022-02-27 14:16:46] [config] - perplexity
[2022-02-27 14:16:46] [config] - translation
[2022-02-27 14:16:46] [config] valid-mini-batch: 8
[2022-02-27 14:16:46] [config] valid-reset-stalled: false
[2022-02-27 14:16:46] [config] valid-script-args:
[2022-02-27 14:16:46] [config] []
[2022-02-27 14:16:46] [config] valid-script-path: ./score-dev.sh
[2022-02-27 14:16:46] [config] valid-sets:
[2022-02-27 14:16:46] [config] - data/dev.bpe.de
[2022-02-27 14:16:46] [config] - data/dev.bpe.en
[2022-02-27 14:16:46] [config] valid-translation-output: model/dev.out
[2022-02-27 14:16:46] [config] vocabs:
[2022-02-27 14:16:46] [config] - data/train.bpe.de.json
[2022-02-27 14:16:46] [config] - data/train.bpe.en.json
[2022-02-27 14:16:46] [config] word-penalty: 0
[2022-02-27 14:16:46] [config] word-scores: false
[2022-02-27 14:16:46] [config] workspace: 3000
[2022-02-27 14:16:46] [config] Model is being created with Marian v1.11.0 f00d062 2022-02-08 08:39:24 -0800
[2022-02-27 14:16:46] Using synchronous SGD
[2022-02-27 14:16:46] [comm] Compiled without MPI support. Running as a single process on moses-Precision-Tower-7910
[2022-02-27 14:16:46] Synced seed 1111
[2022-02-27 14:16:46] [data] Loading vocabulary from JSON/Yaml file data/train.bpe.de.json
[2022-02-27 14:16:46] [data] Using unused word id eos for 0
[2022-02-27 14:16:46] [data] Using unused word id UNK for 1
[2022-02-27 14:16:46] [data] Setting vocabulary size for input 0 to 50,000
[2022-02-27 14:16:46] [data] Loading vocabulary from JSON/Yaml file data/train.bpe.en.json
[2022-02-27 14:16:47] [data] Using unused word id eos for 0
[2022-02-27 14:16:47] [data] Using unused word id UNK for 1
[2022-02-27 14:16:47] [data] Setting vocabulary size for input 1 to 50,000
[2022-02-27 14:16:47] [batching] Collecting statistics for batch fitting with step size 10
[2022-02-27 14:16:47] [memory] Extending reserved space to 3072 MB (device gpu0)
[2022-02-27 14:16:47] [comm] Using NCCL 2.8.3 for GPU communication
[2022-02-27 14:16:47] [comm] Using global sharding
[2022-02-27 14:16:47] [comm] NCCLCommunicators constructed successfully
[2022-02-27 14:16:47] [training] Using 1 GPUs
[2022-02-27 14:16:47] [logits] Applying loss function for 1 factor(s)
[2022-02-27 14:16:47] [memory] Reserving 422 MB, device gpu0
[2022-02-27 14:16:47] [gpu] 16-bit TensorCores enabled for float32 matrix operations
[2022-02-27 14:16:47] Error: Cublas Error: 13 - /home/moses/moses/tj/marian/marian/src/tensors/gpu/prod.cpp:118: cublasGemmEx(handle, transa, transb, m, n, k, alpha, A, CUDA_R_32F, lda, B, CUDA_R_32F, ldb, beta, C, CUDA_R_32F, ldc, CUDA_R_32F, algorithm)
[2022-02-27 14:16:47] Error: Aborted from static void marian::gpu::TypedGemm<float, float>::gemm(cublasHandle_t, marian::gpu::CudaCompute, cublasOperation_t, cublasOperation_t, int, int, int, const float*, const float*, int, const float*, int, const float*, float*, int) in /home/moses/moses/tj/marian/marian/src/tensors/gpu/prod.cpp:118

[CALL STACK]
[0x56408b3f5fd6] marian::gpu::TypedGemm<float,float>:: gemm (cublasContext*, marian::gpu::CudaCompute, cublasOperation_t, cublasOperation_t, int, int, int, float const*, float const*, int, float const*, int, float const*, float*, int) + 0x5a6
[0x56408b3f6e75] void marian::gpu:: ProdTyped <float,float>(IntrusivePtrmarian::TensorBase, IntrusivePtrmarian::TensorBase const&, IntrusivePtrmarian::TensorBase const&, bool, bool, float, float) + 0x8a5
[0x56408b3f1283] marian::gpu:: Prod (IntrusivePtrmarian::TensorBase, IntrusivePtrmarian::TensorBase const&, IntrusivePtrmarian::TensorBase const&, bool, bool, float, float, marian::Type) + 0x493
[0x56408b3f16de] marian::gpu:: Prod (IntrusivePtrmarian::TensorBase, IntrusivePtrmarian::TensorBase const&, IntrusivePtrmarian::TensorBase const&, bool, bool, float, float) + 0x4e
[0x56408aeeede2] std::_Function_handler<void (),marian::DotNodeOp::forwardOps()::{lambda()#1}>:: _M_invoke (std::_Any_data const&) + 0x1f2
[0x56408af99e61] marian::Node:: forward () + 0x211
[0x56408ae8ebcb] marian::ExpressionGraph:: forward (std::__cxx11::list<IntrusivePtr<marian::Chainable<IntrusivePtrmarian::TensorBase>>,std::allocator<IntrusivePtr<marian::Chainable<IntrusivePtrmarian::TensorBase>>>>&, bool) + 0x22b
[0x56408ae9074c] marian::ExpressionGraph:: forwardNext () + 0x23c
[0x56408b133388] marian::GraphGroup:: collectStats (std::shared_ptrmarian::ExpressionGraph, std::shared_ptrmarian::models::ICriterionFunction, std::vector<std::shared_ptrmarian::Vocab,std::allocator<std::shared_ptrmarian::Vocab>> const&, double) + 0xcc8
[0x56408b119f38] marian::SyncGraphGroup:: collectStats (std::vector<std::shared_ptrmarian::Vocab,std::allocator<std::shared_ptrmarian::Vocab>> const&) + 0x138
[0x56408acbf0a7] marian::Trainmarian::SyncGraphGroup:: run () + 0x5c7
[0x56408ac08146] mainTrainer (int, char**) + 0x136
[0x56408abbd8c5] main + 0x35
[0x7f6d39389bf7] __libc_start_main + 0xe7
[0x56408ac066fa] _start + 0x2a

Aborted (core dumped)

Context

Marian version: v1.11.0
CMake command: cmake ..attach the output of --build-info all
Log file: Attach your training/decoding logs

train.log