NVIDIA/FasterTransformer

gptneox_example error

Opened this issue · 1 comments

Branch/Tag/Commit

main

Docker Image Version

/docker/Dockerfile.torch

GPU name

A100

CUDA Driver

470.129.06

Reproduced Steps

./bin/gptneox_example
model:GPT-Neox-20B
batchsize=8,seqlenin=256,seqlenout =512,fp16

Total ranks: 1.
Device NVIDIA A100-SXM4-80GB
P0 is running with GPU #0.
[FT][WARNING] Skip NCCL initialization since requested tensor/pipeline parallel sizes are equals to 1.
start_id : /home/zjf/FasterTransformer/examples/cpp/gptneox/start_ids_8.csv
[WARNING] gemm_config.in is not found; using default GEMM algo
after allocation    : free: 40.15 GB, total: 79.35 GB, used: 39.20 GB
terminate called after throwing an instance of 'std::runtime_error'
  what():  [FT][ERROR] CUDA runtime error: CUBLAS_STATUS_EXECUTION_FAILED /home/zjf/workspace/FasterTransformer/src/fastertransformer/utils/cublasMMWrapper.cc:115 

[ml-a100-ser160:3763638] *** Process received signal ***
[ml-a100-ser160:3763638] Signal: Aborted (6)
[ml-a100-ser160:3763638] Signal code:  (-6)
[ml-a100-ser160:3763638] [ 0] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f6b32834420]
[ml-a100-ser160:3763638] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f6b3231d00b]
[ml-a100-ser160:3763638] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f6b322fc859]
[ml-a100-ser160:3763638] [ 3] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x7f6b326d4911]
[ml-a100-ser160:3763638] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x7f6b326e038c]
[ml-a100-ser160:3763638] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3f7)[0x7f6b326e03f7]
[ml-a100-ser160:3763638] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa6a9)[0x7f6b326e06a9]
[ml-a100-ser160:3763638] [ 7] ./bin/gptneox_example(+0x1f738)[0x55bcac460738]
[ml-a100-ser160:3763638] [ 8] ./bin/gptneox_example(+0x23d9a7)[0x55bcac67e9a7]
[ml-a100-ser160:3763638] [ 9] ./bin/gptneox_example(+0x88a95)[0x55bcac4c9a95]
[ml-a100-ser160:3763638] [10] ./bin/gptneox_example(+0x6431c)[0x55bcac4a531c]
[ml-a100-ser160:3763638] [11] ./bin/gptneox_example(+0x2b15f)[0x55bcac46c15f]
[ml-a100-ser160:3763638] [12] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f6b322fe083]
[ml-a100-ser160:3763638] [13] ./bin/gptneox_example(+0x4cf8e)[0x55bcac48df8e]
[ml-a100-ser160:3763638] *** End of error message ***
Aborted

fusedQKV_masked_attention_dispatch will generate nan when used fp16.