LLaVa-NeXT/Fine_tune_LLaVaNeXT_on_a_custom_dataset_(with_PyTorch_Lightning).ipynb fails at training

Question

LLaVa-NeXT/Fine_tune_LLaVaNeXT_on_a_custom_dataset_(with_PyTorch_Lightning).ipynb fails at training

Opened this issue 3 months ago · 0 comments

ValueError Traceback (most recent call last)
in <cell line: 19>()
17 )
18
---> 19 trainer.fit(model_module)

24 frames
/usr/local/lib/python3.10/dist-packages/transformers/models/llava_next/modeling_llava_next.py in _merge_input_ids_with_image_features(self, image_features, feature_lens, inputs_embeds, input_ids, attention_mask, position_ids, labels, image_token_index, ignore_index)
541 total_num_special_image_tokens = torch.sum(special_image_token_mask)
542 if total_num_special_image_tokens != num_images:
--> 543 raise ValueError(
544 f"Number of image tokens in input_ids ({total_num_special_image_tokens}) different from num_images ({num_images})."
545 )

ValueError: Number of image tokens in input_ids (0) different from num_images (1).

this error appears only after fixing another error concerning the chat_template:

in the collate functions:
chat_template = (
"{% if messages[0]['role'] == 'instruction' %}"
"Instruction: {{- messages[0]['content'] }}\n"
"{% set messages = messages[1:] %}"
"{% endif %}"
"{% for message in messages %}"
"Question:"
"{% for line in message['query'] %}"
"{% if line['type'] == 'text' %}"
"{{- line['text'] }}"
"{% elif line['type'] == 'image' %}"
"{{ '' }}"
"{% endif %}"
"{% endfor %}"
"<end_of_utterance>\n"
"{% if 'answer' in message %}"
"Short answer: "
"{% for line in message['answer'] %}"
"{% if line['type'] == 'text' %}"
"{{- line['text'] }}"
"{% elif line['type'] == 'image' %}"
"{{ '' }}"
"{% endif %}"
"{% endfor %}"
"<end_of_utterance>\n"
"{% endif %}"
"\n"
"{% endfor %}"
"{% if add_generation_prompt %}"
"Short answer: "
"{% endif %}"
)

text_prompt = processor.tokenizer.apply_chat_template(conversation, chat_template=chat_template, add_generation_prompt=True)

huggingface/transformers#32303