Error while fine-tuning model using example from "examples/finetune.ipynb": `RuntimeError: shape '[8, 129, 12, 512]' is invalid for input of size 8306688`

Question

Error while fine-tuning model using example from "examples/finetune.ipynb": `RuntimeError: shape '[8, 129, 12, 512]' is invalid for input of size 8306688`

Opened this issue 2 months ago · 21 comments

Hi! I've been experimenting with this model for a few things, and so far I like where it's going.

I want to attempt some fine-tuning, so I followed the same notebook found here: https://github.com/urchade/GLiNER/blob/main/examples/finetune.ipynb

However, when it comes to the training step, I am seeing the following error:

RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in _worker
    output = module(*input, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/gliner/model.py", line 103, in forward
    output = self.model(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/gliner/modeling/base.py", line 232, in forward
    span_rep = self.span_rep_layer(words_embedding, span_idx)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/gliner/modeling/span_rep.py", line 356, in forward
    return self.span_rep_layer(x, *args)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/gliner/modeling/span_rep.py", line 286, in forward
    return self.out_project(cat).view(B, L, self.max_width, D)
RuntimeError: shape '[8, 129, 12, 512]' is invalid for input of size 8306688

I tried to debug it myself, but I am rather confused as to why this is happening. All previous steps are copied verbatim from the example finetune notebook.

Maybe some useful info:

torch 2.4.0+cu121
gliner 0.2.10
transformers 4.45.0.dev0
huggingface_hub 0.23.2
accelerate 0.34.0
Python 3.10.12

Any ideas? I can provide more information where needed.

Thanks!

Answer 1 · 2024-09-04T20:55:03.000Z

Just realized that the same error happens when trying to use train.py included in the root of the project. Different dimensions/input size, but the same error.

Answer 2 · 2024-09-05T08:10:17.000Z

Can you send the whole config.yaml file you are using?

Answer 3 · 2024-09-05T13:22:08.000Z

@Ingvarstep

I did not create nor set up a new configuration file nor set a configuration path using --config when I attempted it via train.py, so it defaulted to configs/config.yaml as per https://github.com/urchade/GLiNER/blob/7495d0a9be807504a8da33b059747e6fd66c331e/train.py#L19C57-L19C76

After I downloaded the training data (data.json) from https://huggingface.co/datasets/urchade/pile-mistral-v0.1, I put the file into the same directory as the GLiNER checkout directory. Then, I ran python3 train.py keeping everything defaulted.

If you need anything else let me know!

Answer 4 · 2024-09-06T07:28:21.000Z

@Ingvarstep

I did not create nor set up a new configuration file nor set a configuration path using --config when I attempted it via train.py, so it defaulted to configs/config.yaml as per https://github.com/urchade/GLiNER/blob/7495d0a9be807504a8da33b059747e6fd66c331e/train.py#L19C57-L19C76

After I downloaded the training data (data.json) from https://huggingface.co/datasets/urchade/pile-mistral-v0.1, I put the file into the same directory as the GLiNER checkout directory. Then, I ran python3 train.py keeping everything defaulted.

If you need anything else let me know!

I had a similar error when exploring the model on my data. This error indicates there is a mismatch between the sequence lengths: 129 and 169. Check if the label is preprocessed(padding) correctly to match the input sequence length. Basically the batch_size * sequence_length * width * hidden_size [8, 129, 12, 512] should match the final number 8306688.

Answer 5 · 2024-09-06T13:33:37.000Z

@joywang233 good to know, thanks :) That will come in handy if I need to start training on my own data. It's just confusing to me because I'm using the same exact training data as gliner 2.1 -- https://huggingface.co/datasets/urchade/pile-mistral-v0.1 -- and seeing the issue. I'd assume the v2.1 model that was published to HuggingFace was trained on that data used the same Python script to do it (or utilizing the same Jupyter notebook).

Answer 6 · 2024-09-07T23:43:01.000Z

How u try the latest code base, I don't see ur issue

Answer 7 · 2024-09-08T01:12:50.000Z

@xingchaozh I have not. I will try again with the latest on Monday. Thanks for the heads up.

Answer 8 · 2024-09-09T14:48:14.000Z

@xingchaozh So I just tried it using the updated codebase, and I am still having the same issue. Can I ask what dataset you are using, and if it differs from what gliner was originally trained on? Some output:

$ git log -1
commit 65d58a0ae170e8eb31c13d6fedea186e32ef5b96 (HEAD -> main, origin/main, origin/HEAD)
Author: Urchade Zaratiana <38214774+urchade@users.noreply.github.com>
Date:   Sun Sep 8 05:38:18 2024 +0200

    Update README.md

$ cat gliner/__init__.py
__version__ = "0.2.11"

from .model import GLiNER
from .config import GLiNERConfig

__all__ = ["GLiNER"]

$ python3 ./train.py
Dataset size: 19724
Dataset is shuffled...
Dataset is splitted...
/home/user/.local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be deprecated in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
/home/user/.local/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py:551: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text.
  warnings.warn(
Initializing cross fuser...
Post fusion layer: l2l-l2t-t2t
Number of post fusion layers: 3
/home/user/.local/lib/python3.10/site-packages/transformers/training_args.py:1539: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(
max_steps is given, it will override any value given in num_train_epochs
  0%|                                                                                                                                                                                                                                                                                                                                                           | 0/100000 [00:00<?, ?it/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
/home/user/.local/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py:79: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):

Original Traceback (most recent call last):
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in _worker
    output = module(*input, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/projects/GLiNER/gliner/model.py", line 103, in forward
    output = self.model(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/projects/GLiNER/gliner/modeling/base.py", line 238, in forward
    span_rep = self.span_rep_layer(words_embedding, span_idx)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/projects/GLiNER/gliner/modeling/span_rep.py", line 356, in forward
    return self.span_rep_layer(x, *args)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/projects/GLiNER/gliner/modeling/span_rep.py", line 286, in forward
    return self.out_project(cat).view(B, L, self.max_width, D)
RuntimeError: shape '[8, 264, 12, 768]' is invalid for input of size 19759104

$ sha256sum data.json
00d05d679a63d83b72042b599f59ff22e547884c18b4405bc347ba65accd6e15  data.json

Answer 9 · 2024-09-11T02:56:00.000Z

@BradyBonnette I changed nothing, neither code base(latest as u show) nor dataset (pipe-mistral-v0.1).
I was training the model on a M1 Macbook with below transformers version:
transformers 4.40.2
torch 2.3.0

Answer 10 · 2024-09-11T03:00:10.000Z

Well now I am even more confused 🙃 What hardware are you running on?

Answer 11 · 2024-09-11T05:05:30.000Z

MacBook Pro with M1 Max

Well now I am even more confused 🙃 What hardware are you running on?

Answer 12 · 2024-09-11T13:27:26.000Z

@xingchaozh ah geez, sorry, you mentioned that. My fault!

Did you run train.py in the same manner I did? I saw that you didnt change any of the codebase, but how did you run train.py?

Answer 13 · 2024-09-23T12:35:27.000Z

Hello, I encountered exactly the same issue as him. I am running the script from example/finetune.ipynb without any modifications. However, I am getting the following error as well.

Answer 14 · 2024-09-23T17:21:11.000Z

Besides this issue, there seems to be another problem:

Answer 15 · 2024-09-23T17:25:01.000Z

Strangely, when I tried to implement it on another machine, it succeeded. The only difference between the two machines is that on the failing machine, I used the mirror hf-mirror.com because I couldn't directly access Hugging Face.

Answer 16 · 2024-09-23T18:16:12.000Z

@Ask-sola which dataset are you using?

Answer 17 · 2024-09-23T19:23:25.000Z

您使用的是哪个数据集？
I am using the dataset provided in finetune.ipynb:
! wget https://hf-mirror.com/datasets/urchade/synthetic-pii-ner-mistral-v1/resolve/main/data.json
What’s even stranger is that even when I load the data using the following function, with the checkpoint coming from a program that executed successfully, I still cannot successfully train on the new machine:

model = GLiNER.from_pretrained("models/checkpoint-100", load_tokenizer=True)

Answer 18 · 2024-09-23T19:27:22.000Z

The same dataset, the same code, and the same Python version yield different results when executed on different machines. The only difference might be that the machine where it fails is using a mirror for Hugging Face, but theoretically, that shouldn't have any impact.

Answer 19 · 2024-09-24T15:15:28.000Z

@Ask-sola

yield different results when executed on different machines

What are the hardware differences between the two machines? Are they also using the same exact versions of torch/transformers/etc?

Answer 20 · 2024-10-06T02:34:03.000Z

@Ask-sola

yield different results when executed on different machines

What are the hardware differences between the two machines? Are they also using the same exact versions of torch/transformers/etc?

The machine that cannot run is a 4-card 4090 server, while the one that can run successfully is a single-card 4090 server. I'm not sure if the hardware differences between them are causing the failure. The Python and Torch versions are exactly the same; in fact, I configured the servers from scratch in the same way. Strangely, although the 4-card machine cannot perform training, it can execute testing, which makes me lean towards the idea that there might be an issue with the train code.

Answer 21 · 2024-10-08T15:54:06.000Z

@Ask-sola

I think you might be onto something. The machine I was trying to run this on is dual GPU (I have two NVIDIA RTX A6000 gpus), and never thought about the fact that the problem could be resulting from a multi-GPUs setup.

I put together an extremely simple Dockerfile for me to test, and the reason I went with Docker for this was because I can isolate/utilize exactly one GPU in the container at runtime. This is a crude Dockerfile, but gets the job done:

FROM nvidia/cuda:12.5.0-runtime-ubuntu22.04

SHELL ["bash", "-l", "-c"]

RUN apt update && apt install -y curl

RUN curl -LsSf https://astral.sh/uv/install.sh | sh

COPY . /src

WORKDIR /src

RUN uv sync

When I build the container and run it with: docker run --shm-size=512m --gpus=all -it --rm gliner followed by uv run train.py in the container, I see:

   [...omitted...]
    raise ValueError(f"Target size ({target.size()}) must be the same as input size ({input.size()})")
ValueError: Target size (torch.Size([61664, 21])) must be the same as input size (torch.Size([31584, 21]))
   [...omitted...]

However, when I run the same exact container with docker run --gpus='"device=0"' -it --rm gliner followed by uv run train.py, I see:

[...omitted...]
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
  0%|▍   | 165/100000 [00:18<3:12:43,  8.63it/s]

Note that I had to do some things to get uv to work properly, but in the end I think it demonstrates the case that there's something weird going on with multi-GPU setups.

EDIT: For what it's worth, I let the training cycle go for as long as it could. It completed 100%, and the checkpoint model was valid.