Error while fine-tuning model using example from "examples/finetune.ipynb": `RuntimeError: shape '[8, 129, 12, 512]' is invalid for input of size 8306688`
Opened this issue · 21 comments
Hi! I've been experimenting with this model for a few things, and so far I like where it's going.
I want to attempt some fine-tuning, so I followed the same notebook found here: https://github.com/urchade/GLiNER/blob/main/examples/finetune.ipynb
However, when it comes to the training step, I am seeing the following error:
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/user/.local/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in _worker
output = module(*input, **kwargs)
File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/.local/lib/python3.10/site-packages/gliner/model.py", line 103, in forward
output = self.model(*args, **kwargs)
File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/.local/lib/python3.10/site-packages/gliner/modeling/base.py", line 232, in forward
span_rep = self.span_rep_layer(words_embedding, span_idx)
File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/.local/lib/python3.10/site-packages/gliner/modeling/span_rep.py", line 356, in forward
return self.span_rep_layer(x, *args)
File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/.local/lib/python3.10/site-packages/gliner/modeling/span_rep.py", line 286, in forward
return self.out_project(cat).view(B, L, self.max_width, D)
RuntimeError: shape '[8, 129, 12, 512]' is invalid for input of size 8306688
I tried to debug it myself, but I am rather confused as to why this is happening. All previous steps are copied verbatim from the example finetune notebook.
Maybe some useful info:
torch 2.4.0+cu121
gliner 0.2.10
transformers 4.45.0.dev0
huggingface_hub 0.23.2
accelerate 0.34.0
Python 3.10.12
Any ideas? I can provide more information where needed.
Thanks!
Just realized that the same error happens when trying to use train.py included in the root of the project. Different dimensions/input size, but the same error.
Can you send the whole config.yaml file you are using?
I did not create nor set up a new configuration file nor set a configuration path using --config
when I attempted it via train.py
, so it defaulted to configs/config.yaml
as per https://github.com/urchade/GLiNER/blob/7495d0a9be807504a8da33b059747e6fd66c331e/train.py#L19C57-L19C76
After I downloaded the training data (data.json) from https://huggingface.co/datasets/urchade/pile-mistral-v0.1, I put the file into the same directory as the GLiNER checkout directory. Then, I ran python3 train.py
keeping everything defaulted.
If you need anything else let me know!
I did not create nor set up a new configuration file nor set a configuration path using
--config
when I attempted it viatrain.py
, so it defaulted toconfigs/config.yaml
as per https://github.com/urchade/GLiNER/blob/7495d0a9be807504a8da33b059747e6fd66c331e/train.py#L19C57-L19C76After I downloaded the training data (data.json) from https://huggingface.co/datasets/urchade/pile-mistral-v0.1, I put the file into the same directory as the GLiNER checkout directory. Then, I ran
python3 train.py
keeping everything defaulted.If you need anything else let me know!
I had a similar error when exploring the model on my data. This error indicates there is a mismatch between the sequence lengths: 129 and 169. Check if the label is preprocessed(padding) correctly to match the input sequence length. Basically the batch_size * sequence_length * width * hidden_size [8, 129, 12, 512] should match the final number 8306688.
@joywang233 good to know, thanks :) That will come in handy if I need to start training on my own data. It's just confusing to me because I'm using the same exact training data as gliner 2.1 -- https://huggingface.co/datasets/urchade/pile-mistral-v0.1 -- and seeing the issue. I'd assume the v2.1 model that was published to HuggingFace was trained on that data used the same Python script to do it (or utilizing the same Jupyter notebook).
@xingchaozh I have not. I will try again with the latest on Monday. Thanks for the heads up.
@xingchaozh So I just tried it using the updated codebase, and I am still having the same issue. Can I ask what dataset you are using, and if it differs from what gliner was originally trained on? Some output:
$ git log -1
commit 65d58a0ae170e8eb31c13d6fedea186e32ef5b96 (HEAD -> main, origin/main, origin/HEAD)
Author: Urchade Zaratiana <38214774+urchade@users.noreply.github.com>
Date: Sun Sep 8 05:38:18 2024 +0200
Update README.md
$ cat gliner/__init__.py
__version__ = "0.2.11"
from .model import GLiNER
from .config import GLiNERConfig
__all__ = ["GLiNER"]
$ python3 ./train.py
Dataset size: 19724
Dataset is shuffled...
Dataset is splitted...
/home/user/.local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be deprecated in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(
/home/user/.local/lib/python3.10/site-packages/transformers/convert_slow_tokenizer.py:551: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text.
warnings.warn(
Initializing cross fuser...
Post fusion layer: l2l-l2t-t2t
Number of post fusion layers: 3
/home/user/.local/lib/python3.10/site-packages/transformers/training_args.py:1539: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
warnings.warn(
max_steps is given, it will override any value given in num_train_epochs
0%| | 0/100000 [00:00<?, ?it/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
/home/user/.local/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py:79: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
Original Traceback (most recent call last):
File "/home/user/.local/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in _worker
output = module(*input, **kwargs)
File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/projects/GLiNER/gliner/model.py", line 103, in forward
output = self.model(*args, **kwargs)
File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/projects/GLiNER/gliner/modeling/base.py", line 238, in forward
span_rep = self.span_rep_layer(words_embedding, span_idx)
File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/projects/GLiNER/gliner/modeling/span_rep.py", line 356, in forward
return self.span_rep_layer(x, *args)
File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/projects/GLiNER/gliner/modeling/span_rep.py", line 286, in forward
return self.out_project(cat).view(B, L, self.max_width, D)
RuntimeError: shape '[8, 264, 12, 768]' is invalid for input of size 19759104
$ sha256sum data.json
00d05d679a63d83b72042b599f59ff22e547884c18b4405bc347ba65accd6e15 data.json
@BradyBonnette I changed nothing, neither code base(latest as u show) nor dataset (pipe-mistral-v0.1).
I was training the model on a M1 Macbook with below transformers version:
transformers 4.40.2
torch 2.3.0
Well now I am even more confused 🙃 What hardware are you running on?
MacBook Pro with M1 Max
Well now I am even more confused 🙃 What hardware are you running on?
@xingchaozh ah geez, sorry, you mentioned that. My fault!
Did you run train.py
in the same manner I did? I saw that you didnt change any of the codebase, but how did you run train.py
?
Strangely, when I tried to implement it on another machine, it succeeded. The only difference between the two machines is that on the failing machine, I used the mirror hf-mirror.com because I couldn't directly access Hugging Face.
@Ask-sola which dataset are you using?
您使用的是哪个数据集?
I am using the dataset provided infinetune.ipynb
:
! wget https://hf-mirror.com/datasets/urchade/synthetic-pii-ner-mistral-v1/resolve/main/data.json
What’s even stranger is that even when I load the data using the following function, with the checkpoint coming from a program that executed successfully, I still cannot successfully train on the new machine:
model = GLiNER.from_pretrained("models/checkpoint-100", load_tokenizer=True)
The same dataset, the same code, and the same Python version yield different results when executed on different machines. The only difference might be that the machine where it fails is using a mirror for Hugging Face, but theoretically, that shouldn't have any impact.
yield different results when executed on different machines
What are the hardware differences between the two machines? Are they also using the same exact versions of torch/transformers/etc?
yield different results when executed on different machines
What are the hardware differences between the two machines? Are they also using the same exact versions of torch/transformers/etc?
The machine that cannot run is a 4-card 4090 server, while the one that can run successfully is a single-card 4090 server. I'm not sure if the hardware differences between them are causing the failure. The Python and Torch versions are exactly the same; in fact, I configured the servers from scratch in the same way. Strangely, although the 4-card machine cannot perform training, it can execute testing, which makes me lean towards the idea that there might be an issue with the train code.
I think you might be onto something. The machine I was trying to run this on is dual GPU (I have two NVIDIA RTX A6000 gpus), and never thought about the fact that the problem could be resulting from a multi-GPUs setup.
I put together an extremely simple Dockerfile for me to test, and the reason I went with Docker for this was because I can isolate/utilize exactly one GPU in the container at runtime. This is a crude Dockerfile, but gets the job done:
FROM nvidia/cuda:12.5.0-runtime-ubuntu22.04
SHELL ["bash", "-l", "-c"]
RUN apt update && apt install -y curl
RUN curl -LsSf https://astral.sh/uv/install.sh | sh
COPY . /src
WORKDIR /src
RUN uv sync
When I build the container and run it with: docker run --shm-size=512m --gpus=all -it --rm gliner
followed by uv run train.py
in the container, I see:
[...omitted...]
raise ValueError(f"Target size ({target.size()}) must be the same as input size ({input.size()})")
ValueError: Target size (torch.Size([61664, 21])) must be the same as input size (torch.Size([31584, 21]))
[...omitted...]
However, when I run the same exact container with docker run --gpus='"device=0"' -it --rm gliner
followed by uv run train.py
, I see:
[...omitted...]
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
0%|▍ | 165/100000 [00:18<3:12:43, 8.63it/s]
Note that I had to do some things to get uv
to work properly, but in the end I think it demonstrates the case that there's something weird going on with multi-GPU setups.
EDIT: For what it's worth, I let the training cycle go for as long as it could. It completed 100%, and the checkpoint model was valid.