ImportError: /media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/flash_attn_2_cuda.cpython-38-x86_64-linux-gnu.so: undefined symbol:

Question

ImportError: /media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/flash_attn_2_cuda.cpython-38-x86_64-linux-gnu.so: undefined symbol:

Drzhivago264 opened this issue 7 months ago · 1 comments

Describe the bug
Cannot train with flash attention.
Global attention works

To Reproduce
Steps to reproduce the behavior:

Follow the host setup in this repository.
Run the training script with this config:
{
"pipe_parallel_size": 0,
"model_parallel_size": 4,

"num_layers": 32,
"hidden_size": 2560,
"num_attention_heads": 32,
"seq_length": 2048,
"max_position_embeddings": 2048,
"pos_emb": "rotary",
"rotary_pct": 0.25,
"no_weight_tying": true,
"gpt_j_residual": true,
"output_layer_parallelism": "column",

"attention_config": [[["flash"], 32]],

"scaled_upper_triang_masked_softmax_fusion": true,
"bias_gelu_fusion": true,

"init_method": "small_init",
"output_layer_init_method": "wang_init",

"optimizer": {
"type": "CPU_Adam",
"params": {
"lr": 0.00016,
"betas": [0.9, 0.95],
"eps": 1.0e-8
}
},
"min_lr": 1.6e-05,
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"allgather_partitions": True,
"allgather_bucket_size": 500000000,
"overlap_comm": True,
"reduce_scatter": True,
"reduce_bucket_size": 500000000,
"contiguous_gradients": True,
},

"train_micro_batch_size_per_gpu": 8,
"gradient_accumulation_steps": 2,
"data_impl": "mmap",
"num_workers": 1,

"checkpoint_activations": true,
"checkpoint_num_layers": 1,
"partition_activations": true,
"synchronize_each_layer": true,

"gradient_clipping": 1.0,
"weight_decay": 0.1,
"hidden_dropout": 0,
"attention_dropout": 0,

"fp16": {
"fp16": true,
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 12,
"hysteresis": 2,
"min_loss_scale": 1
},

"train_iters": 143000,
"lr_decay_iters": 143000,
"distributed_backend": "nccl",
"lr_decay_style": "cosine",
"warmup": 0.01,
"checkpoint_factor": 1000,
"extra_save_iters": [64,128,256,512],
"eval_interval": 40000,
"eval_iters": 10,

"log_grad_norm": true,

"log_interval": 10,
"steps_per_print": 10,
"wall_clock_breakdown": true,

"tokenizer_type": "HFTokenizer"
}

Expected behavior
Setting ds_accelerator to cuda (auto detect)
[2023-11-14 16:09:59,189] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2023-11-14 16:09:59,189] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=4, node_rank=0
[2023-11-14 16:09:59,189] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2023-11-14 16:09:59,189] [INFO] [launch.py:163:main] dist_world_size=4
[2023-11-14 16:09:59,189] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
Setting ds_accelerator to cuda (auto detect)
Setting ds_accelerator to cuda (auto detect)
Setting ds_accelerator to cuda (auto detect)
Setting ds_accelerator to cuda (auto detect)
[2023-11-14 16:10:05,177] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-11-14 16:10:05,177] [INFO] [comm.py:594:init_distributed] cdb=None
NeoXArgs.configure_distributed_args() using world size: 4 and model-parallel size: 4

building HFTokenizer tokenizer ...
[2023-11-14 16:10:05,209] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-11-14 16:10:05,209] [INFO] [comm.py:594:init_distributed] cdb=None
padded vocab (size: 50277) with 411 dummy tokens (new size: 50688)
setting tensorboard ...
initializing torch distributed ...
[2023-11-14 16:10:05,376] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-11-14 16:10:05,376] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-11-14 16:10:05,376] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-11-14 16:10:05,405] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-11-14 16:10:05,405] [INFO] [comm.py:594:init_distributed] cdb=None
initializing model parallel with size 4
MPU DP: [0]
MPU DP: [1]
MPU DP: [2]
MPU DP: [3]
MPU PP: [0]
MPU PP: [1]
MPU PP: [2]
MPU PP: [3]
MPU MP: [0, 1, 2, 3]
setting random seeds to 1234 ...
[2023-11-14 16:10:06,287] [INFO] [checkpointing.py:226:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/media/h/nvme/gpt-neox/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/media/h/nvme/gpt-neox/megatron/data'
building GPT2 model ...
SEED_LAYERS=False BASE_SEED=1234 SEED_FN=None
Using topology: {ProcessCoord(pipe=0, data=0, model=0): 0, ProcessCoord(pipe=0, data=0, model=1): 1, ProcessCoord(pipe=0, data=0, model=2): 2, ProcessCoord(pipe=0, data=0, model=3): 3}
[2023-11-14 16:10:06,697] [INFO] [module.py:358:_partition_layers] Partitioning pipeline stages with method type:transformer|mlp
stage=0 layers=37
0: EmbeddingPipe
1: _pre_transformer_block
2: ParallelTransformerLayerPipe
3: ParallelTransformerLayerPipe
4: ParallelTransformerLayerPipe
5: ParallelTransformerLayerPipe
6: ParallelTransformerLayerPipe
7: ParallelTransformerLayerPipe
8: ParallelTransformerLayerPipe
9: ParallelTransformerLayerPipe
10: ParallelTransformerLayerPipe
11: ParallelTransformerLayerPipe
12: ParallelTransformerLayerPipe
13: ParallelTransformerLayerPipe
14: ParallelTransformerLayerPipe
15: ParallelTransformerLayerPipe
16: ParallelTransformerLayerPipe
17: ParallelTransformerLayerPipe
18: ParallelTransformerLayerPipe
19: ParallelTransformerLayerPipe
20: ParallelTransformerLayerPipe
21: ParallelTransformerLayerPipe
22: ParallelTransformerLayerPipe
23: ParallelTransformerLayerPipe
24: ParallelTransformerLayerPipe
25: ParallelTransformerLayerPipe
26: ParallelTransformerLayerPipe
27: ParallelTransformerLayerPipe
28: ParallelTransformerLayerPipe
29: ParallelTransformerLayerPipe
30: ParallelTransformerLayerPipe
31: ParallelTransformerLayerPipe
32: ParallelTransformerLayerPipe
33: ParallelTransformerLayerPipe
34: _post_transformer_block
35: NormPipe
36: ParallelLinearPipe
loss: partial
Traceback (most recent call last):
File "train.py", line 27, in
pretrain(neox_args=neox_args)
File "/media/h/nvme/gpt-neox/megatron/training.py", line 192, in pretrain
model, optimizer, lr_scheduler = setup_model_and_optimizer(
File "/media/h/nvme/gpt-neox/megatron/training.py", line 633, in setup_model_and_optimizer
model = get_model(neox_args=neox_args, use_cache=use_cache)
File "/media/h/nvme/gpt-neox/megatron/training.py", line 407, in get_model
model = GPT2ModelPipe(
File "/media/h/nvme/gpt-neox/megatron/model/gpt2_model.py", line 127, in init
super().init(
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 199, in init
self._build()
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 246, in _build
module = layer.build()
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 73, in build
return self.typename(*self.module_args, **self.module_kwargs)
File "/media/h/nvme/gpt-neox/megatron/model/transformer.py", line 759, in init
self.attention = ParallelSelfAttention(
File "/media/h/nvme/gpt-neox/megatron/model/transformer.py", line 351, in init
from megatron.model.flash_attention import (
File "/media/h/nvme/gpt-neox/megatron/model/flash_attention.py", line 7, in
from flash_attn import flash_attn_triton
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/flash_attn/init.py", line 3, in
from flash_attn.flash_attn_interface import (
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/flash_attn/flash_attn_interface.py", line 8, in
import flash_attn_2_cuda as flash_attn_cuda
ImportError: /media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/flash_attn_2_cuda.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator9allocatorE
Traceback (most recent call last):
File "train.py", line 27, in
Traceback (most recent call last):
File "train.py", line 27, in
pretrain(neox_args=neox_args)
File "/media/h/nvme/gpt-neox/megatron/training.py", line 192, in pretrain
model, optimizer, lr_scheduler = setup_model_and_optimizer(
File "/media/h/nvme/gpt-neox/megatron/training.py", line 633, in setup_model_and_optimizer
model = get_model(neox_args=neox_args, use_cache=use_cache)
File "/media/h/nvme/gpt-neox/megatron/training.py", line 407, in get_model
model = GPT2ModelPipe(
File "/media/h/nvme/gpt-neox/megatron/model/gpt2_model.py", line 127, in init
super().init(
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 199, in init
self._build()
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 246, in _build
module = layer.build()
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 73, in build
return self.typename(*self.module_args, **self.module_kwargs)
File "/media/h/nvme/gpt-neox/megatron/model/transformer.py", line 759, in init
self.attention = ParallelSelfAttention(
File "/media/h/nvme/gpt-neox/megatron/model/transformer.py", line 351, in init
from megatron.model.flash_attention import (
File "/media/h/nvme/gpt-neox/megatron/model/flash_attention.py", line 7, in
from flash_attn import flash_attn_triton
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/flash_attn/init.py", line 3, in
from flash_attn.flash_attn_interface import (
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/flash_attn/flash_attn_interface.py", line 8, in
import flash_attn_2_cuda as flash_attn_cuda
ImportError: /media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/flash_attn_2_cuda.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator9allocatorE
Traceback (most recent call last):
File "train.py", line 27, in
pretrain(neox_args=neox_args)
File "/media/h/nvme/gpt-neox/megatron/training.py", line 192, in pretrain
model, optimizer, lr_scheduler = setup_model_and_optimizer(
File "/media/h/nvme/gpt-neox/megatron/training.py", line 633, in setup_model_and_optimizer
model = get_model(neox_args=neox_args, use_cache=use_cache)
File "/media/h/nvme/gpt-neox/megatron/training.py", line 407, in get_model
model = GPT2ModelPipe(
File "/media/h/nvme/gpt-neox/megatron/model/gpt2_model.py", line 127, in init
super().init(
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 199, in init
self._build()
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 246, in _build
module = layer.build()
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 73, in build
return self.typename(*self.module_args, **self.module_kwargs)
File "/media/h/nvme/gpt-neox/megatron/model/transformer.py", line 759, in init
self.attention = ParallelSelfAttention(
File "/media/h/nvme/gpt-neox/megatron/model/transformer.py", line 351, in init
from megatron.model.flash_attention import (
File "/media/h/nvme/gpt-neox/megatron/model/flash_attention.py", line 7, in
from flash_attn import flash_attn_triton
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/flash_attn/init.py", line 3, in
from flash_attn.flash_attn_interface import (
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/flash_attn/flash_attn_interface.py", line 8, in
import flash_attn_2_cuda as flash_attn_cuda
ImportError: /media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/flash_attn_2_cuda.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator9allocatorE
pretrain(neox_args=neox_args)
File "/media/h/nvme/gpt-neox/megatron/training.py", line 192, in pretrain
model, optimizer, lr_scheduler = setup_model_and_optimizer(
File "/media/h/nvme/gpt-neox/megatron/training.py", line 633, in setup_model_and_optimizer
model = get_model(neox_args=neox_args, use_cache=use_cache)
File "/media/h/nvme/gpt-neox/megatron/training.py", line 407, in get_model
model = GPT2ModelPipe(
File "/media/h/nvme/gpt-neox/megatron/model/gpt2_model.py", line 127, in init
super().init(
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 199, in init
self._build()
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 246, in _build
module = layer.build()
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 73, in build
return self.typename(*self.module_args, **self.module_kwargs)
File "/media/h/nvme/gpt-neox/megatron/model/transformer.py", line 759, in init
self.attention = ParallelSelfAttention(
File "/media/h/nvme/gpt-neox/megatron/model/transformer.py", line 351, in init
from megatron.model.flash_attention import (
File "/media/h/nvme/gpt-neox/megatron/model/flash_attention.py", line 7, in
from flash_attn import flash_attn_triton
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/flash_attn/init.py", line 3, in
from flash_attn.flash_attn_interface import (
File "/media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/flash_attn/flash_attn_interface.py", line 8, in
import flash_attn_2_cuda as flash_attn_cuda
ImportError: /media/h/nvme/gpt-neox/.venv/lib/python3.8/site-packages/flash_attn_2_cuda.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator9allocatorE
[2023-11-14 16:10:09,253] [INFO] [launch.py:314:sigkill_handler] Killing subprocess 345637
[2023-11-14 16:10:09,275] [INFO] [launch.py:314:sigkill_handler] Killing subprocess 345638
[2023-11-14 16:10:09,276] [INFO] [launch.py:314:sigkill_handler] Killing subprocess 345639
[2023-11-14 16:10:09,296] [INFO] [launch.py:314:sigkill_handler] Killing subprocess 345640

Proposed solution
I dont know what to do. I guess there is a mismatch between flashattention version and pytorch version or between flashattention with system CUDA.
I have tested with gpt-neox v1, v2 in combination with Cuda 11.8, 12.3.
It is suggested that flashattention v2 does not support turing GPU, but I face the same problem with flashattention v1.
Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

GPUs: 3 x 2080ti and 1 3060.
Configs: attached above
Package Version

absl-py 2.0.0
aiohttp 3.8.6
aiosignal 1.3.1
anyio 3.7.1
appdirs 1.4.4
async-timeout 4.0.3
attrs 23.1.0
autopep8 2.0.4
best-download 0.0.9
boto3 1.28.84
botocore 1.31.84
cachetools 5.3.2
certifi 2023.7.22
cfgv 3.4.0
chardet 5.2.0
charset-normalizer 3.3.2
clang-format 17.0.4
click 8.1.7
cmake 3.27.7
colorama 0.4.6
coverage 7.3.2
cupy-cuda111 12.2.0
DataProperty 1.0.1
datasets 2.14.6
deepspeed 0.9.3+a48c649
dill 0.3.7
distlib 0.3.7
distro 1.8.0
docker-pycreds 0.4.0
einops 0.7.0
exceptiongroup 1.1.3
execnet 2.0.2
fastrlock 0.8.2
filelock 3.13.1
flash-attn 2.2.1
frozenlist 1.4.0
fsspec 2023.10.0
ftfy 6.1.1
fused-kernels 0.0.1
gitdb 4.0.11
GitPython 3.1.40
google-auth 2.23.4
google-auth-oauthlib 1.0.0
grpcio 1.59.2
h11 0.14.0
hf_transfer 0.1.4
hjson 3.1.0
httpcore 1.0.2
httpx 0.25.1
huggingface-hub 0.19.0
identify 2.5.31
idna 3.4
importlib-metadata 6.8.0
iniconfig 2.0.0
Jinja2 3.1.2
jmespath 1.0.1
joblib 1.3.2
jsonlines 4.0.0
lm-dataformat 0.0.20
lm-eval 0.3.0
Markdown 3.5.1
MarkupSafe 2.1.3
mbstrdecoder 1.1.3
mpi4py 3.1.5
mpmath 1.3.0
multidict 6.0.4
multiprocess 0.70.15
networkx 3.1
ninja 1.11.1.1
nltk 3.8.1
nodeenv 1.8.0
numexpr 2.8.6
numpy 1.24.4
nvidia-cublas-cu11 11.10.3.66
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu11 8.5.0.96
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.18.1
nvidia-nvjitlink-cu12 12.3.52
nvidia-nvtx-cu12 12.1.105
oauthlib 3.2.2
openai 1.2.3
packaging 23.2
pandas 2.0.3
pathvalidate 3.2.0
pip 23.3.1
platformdirs 3.11.0
pluggy 1.3.0
portalocker 2.8.2
pre-commit 3.5.0
protobuf 4.25.0
psutil 5.9.6
py 1.11.0
py-cpuinfo 9.0.0
pyarrow 14.0.1
pyasn1 0.5.0
pyasn1-modules 0.3.0
pybind11 2.11.1
pycodestyle 2.11.1
pycountry 22.3.5
pydantic 1.10.13
pytablewriter 1.2.0
pytest 7.4.3
pytest-cov 4.1.0
pytest-forked 1.6.0
pytest-xdist 3.4.0
python-dateutil 2.8.2
pytz 2023.3.post1
PyYAML 6.0.1
regex 2023.10.3
rehash 1.0.1
requests 2.31.0
requests-oauthlib 1.3.1
rouge-score 0.1.2
rsa 4.9
s3transfer 0.7.0
sacrebleu 1.5.0
safetensors 0.4.0
scikit-learn 1.3.2
scipy 1.10.1
sentencepiece 0.1.99
sentry-sdk 1.34.0
setproctitle 1.3.3
setuptools 56.0.0
six 1.16.0
smmap 5.0.1
sniffio 1.3.0
sqlitedict 2.1.0
sympy 1.12
tabledata 1.3.3
tcolorpy 0.1.4
tensorboard 2.13.0
tensorboard-data-server 0.7.2
threadpoolctl 3.2.0
tiktoken 0.5.1
tokenizers 0.13.3
tomli 2.0.1
torch 1.13.1
tqdm 4.66.1
tqdm-multiprocess 0.0.11
transformers 4.30.2
triton 2.0.0.dev20221202
typepy 1.3.2
typing_extensions 4.8.0
tzdata 2023.3
ujson 5.8.0
urllib3 1.26.18
virtualenv 20.24.6
wandb 0.16.0
wcwidth 0.2.9
Werkzeug 3.0.1
wheel 0.41.3
xxhash 3.4.1
yarl 1.9.2
zipp 3.17.0
zstandard 0.22.0

Cuda: nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

Answer 1 · 2024-03-01T08:06:01.000Z

you can try flash_attn-2.3.0