inference error

Question

inference error

white2018 opened this issue 9 months ago · 6 comments

white2018 commented 9 months ago

@Ucas-HaoranWei Nice work! I run into inference error as follows. How to fix it? Thanks

Running script:
python tests/models/test_varytiny.py --image-file ../demo.jpg

Output shows Error:
OCR:
../aten/src/ATen/native/cuda/Indexing.cu:1237: indexSelectSmallIndex: block: [5,0,0], thread: [0,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1237: indexSelectSmallIndex: block: [5,0,0], thread: [1,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1237: indexSelectSmallIndex: block: [5,0,0], thread: [2,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1237: indexSelectSmallIndex: block: [5,0,0], thread: [3,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1237: indexSelectSmallIndex: block: [5,0,0], thread: [4,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1237: indexSelectSmallIndex: block: [5,0,0], thread: [5,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1237: indexSelectSmallIndex: block: [5,0,0], thread: [6,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1237: indexSelectSmallIndex: block: [2,0,0], thread: [127,0,0] Assertion srcIndex < srcSelectDimSize failed.
Traceback (most recent call last):
File "/root/Vary/LAVIS-main/tests/models/test_varytiny.py", line 49, in
test_vary_opt125m(args.image_file)
File "/root/Vary/LAVIS-main/tests/models/test_varytiny.py", line 37, in test_vary_opt125m
captions = model.generate({"image": image, "prompt": question}, num_captions=1)
File "/usr/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/root/Vary/LAVIS-main/lavis/models/varytiny_models/vary_opt.py", line 231, in generate
outputs = self.opt_model.generate(
File "/usr/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/lib/python3.10/site-packages/transformers/generation/utils.py", line 1675, in generate
return self.beam_search(
File "/usr/lib/python3.10/site-packages/transformers/generation/utils.py", line 3014, in beam_search
outputs = self(
File "/usr/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 944, in forward
outputs = self.model.decoder(
File "/usr/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 710, in forward
layer_outputs = decoder_layer(
File "/usr/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 330, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/usr/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 225, in forward
attn_weights, torch.tensor(torch.finfo(attn_weights.dtype).min, device=attn_weights.device)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Answer 1 · 2024-04-23T07:20:20.000Z

Have you changed the max-length of opt-125m?
What is your transformer version and environment?

Answer 2 · 2024-04-23T07:52:07.000Z

Have you changed the max-length of opt-125m?
Not yet

What is your transformer version and environment?
Package Version

accelerate 0.24.1
albumentations 1.4.0
annotated-types 0.6.0
antlr4-python3-runtime 4.9.3
anyio 4.3.0
appdirs 1.4.4
arxiv 2.1.0
bitsandbytes 0.41.0
blinker 1.7.0
braceexpand 0.1.7
certifi 2024.2.2
charset-normalizer 3.3.2
click 8.1.7
contourpy 1.2.1
cycler 0.12.1
decord 0.6.0
deepspeed 0.12.3
docker-pycreds 0.4.0
e 1.4.5
easydict 1.13
einops 0.6.1
einops-exts 0.0.4
et-xmlfile 1.1.0
exceptiongroup 1.2.0
feedparser 6.0.10
filelock 3.13.3
flash-attn 2.5.6
Flask 3.0.3
fonttools 4.51.0
fsspec 2024.3.1
gitdb 4.0.11
GitPython 3.1.42
gradio_client 0.2.9
h11 0.14.0
hjson 3.1.0
httpcore 0.17.3
httpx 0.24.0
huggingface-hub 0.22.1
idna 3.6
imageio 2.34.0
iopath 0.1.10
itsdangerous 2.1.2
Jinja2 3.1.3
joblib 1.3.2
kiwisolver 1.4.5
lazy_loader 0.3
lxml 5.2.1
markdown2 2.4.13
MarkupSafe 2.1.5
matplotlib 3.8.4
mpmath 1.3.0
networkx 3.2.1
ninja 1.11.1.1
numpy 1.26.4
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.19.3
nvidia-nvjitlink-cu12 12.4.99
nvidia-nvtx-cu12 12.1.105
omegaconf 2.3.0
opencv-python 4.9.0.80
opencv-python-headless 4.9.0.80
openpyxl 3.1.2
packaging 24.0
pandas 2.2.2
pdf2image 1.17.0
peft 0.4.0
pillow 10.2.0
pip 24.0
portalocker 2.8.2
prettytable 3.10.0
progressbar 2.5
protobuf 4.25.3
psutil 5.9.8
py-cpuinfo 9.0.0
pycocoevalcap 1.2
pycocotools 2.0.7
pydantic 2.6.4
pydantic_core 2.16.3
Pygments 2.17.2
PyMuPDF 1.24.1
PyMuPDFb 1.24.1
pynvml 11.5.0
pyparsing 3.1.2
PyPDF2 3.0.1
python-dateutil 2.9.0.post0
python-docx 1.1.0
pytz 2024.1
PyYAML 6.0.1
qudida 0.0.4
regex 2023.12.25
requests 2.31.0
safetensors 0.4.2
scikit-image 0.22.0
scikit-learn 1.2.2
scipy 1.12.0
sentencepiece 0.1.99
sentry-sdk 1.44.0
setproctitle 1.3.3
setuptools 69.2.0
sgmllib3k 1.0.0
shortuuid 1.0.13
six 1.16.0
smmap 5.0.1
sniffio 1.3.1
svgwrite 1.4.3
sympy 1.12
threadpoolctl 3.4.0
tifffile 2024.2.12
tiktoken 0.6.0
timm 0.6.13
tokenizers 0.13.3
torch 2.2.2
torchvision 0.17.2
tqdm 4.66.2
transformers 4.32.1
triton 2.2.0
typing_extensions 4.10.0
tzdata 2024.1
urllib3 2.2.1
wandb 0.16.5
wavedrom 2.0.3.post3
wcwidth 0.2.13
webdataset 0.2.86
websockets 12.0
Werkzeug 3.0.2
wheel 0.43.0

Answer 3 · 2024-04-23T07:57:08.000Z

Change the max_length of line 236 in vary_opt.py to 1600. Your test sample may have too many texts.
Can you share your demo.jpg?

Answer 4 · 2024-04-23T08:21:00.000Z

Change the max_length of line 236 in vary_opt.py to 1600. Your test sample may have too many texts. Can you share your demo.jpg?

1600 works fine without error before. However, the result is not good enough than your previous vary project. check it out.

Answer 5 · 2024-04-23T08:28:48.000Z

The result is not good because the Vary-tiny-600k.pth only used a 600k dataset to train.

Answer 6 · 2024-04-23T08:53:00.000Z

The result is not good because the Vary-tiny-600k.pth only used a 600k dataset to train.

Thanks for your reply