how to do the inference with the finetune weights / model
thisurawz1 opened this issue ยท 12 comments
I have already fine-tuned the videollama2 for a custom dataset using qlora. after fine-tuning got the above files. now, how can I make the inference with those weights/ models? how can I use this finetune weights/ model with the inference script you provided?
Looking forward to a solution as soon as possible. thank you.
`
import sys
sys.path.append('./')
from videollama2 import model_init, mm_infer
from videollama2.utils import disable_torch_init
def inference():
disable_torch_init()
# Video Inference
modal = 'video'
modal_path = 'assets/cat_and_chicken.mp4'
instruct = 'What animals are in the video, what are they doing, and how does the video feel?'
# Reply:
# The video features a kitten and a baby chick playing together. The kitten is seen laying on the floor while the baby chick hops around. The two animals interact playfully with each other, and the video has a cute and heartwarming feel to it.
# Image Inference
modal = 'image'
modal_path = 'assets/sora.png'
instruct = 'What is the woman wearing, what is she doing, and how does the image feel?'
# Reply:
# The woman in the image is wearing a black coat and sunglasses, and she is walking down a rain-soaked city street. The image feels vibrant and lively, with the bright city lights reflecting off the wet pavement, creating a visually appealing atmosphere. The woman's presence adds a sense of style and confidence to the scene, as she navigates the bustling urban environment.
model_path = 'DAMO-NLP-SG/VideoLLaMA2-7B'
# Base model inference (only need to replace model_path)
# model_path = 'DAMO-NLP-SG/VideoLLaMA2-7B-Base'
model, processor, tokenizer = model_init(model_path)
output = mm_infer(processor[modal](modal_path), instruct, model=model, tokenizer=tokenizer, do_sample=False, modal=modal)
print(output)
if name == "main":
inference()
`
Yes, you can. The newest version commit supports directly loading lora model.
Can you share the script for it please. Do we just have to change the current model path to lora path. I did it but didn't work at all.
can you share the exact script that we can do the inference with the LoRA weights. please.
Yes, you can. The newest version commit supports directly loading the Lora model.
Can you share the script on how to load the Lora model directly? I already finished the fine-tuning. And got those files. But I don't know how to do the inference with these.
Hello! I have the same problem. Have you solved it?
Can you share the script on how to load the Lora model directly? I already finished the fine-tuning. And got those files. But I don't know how to do the inference with these.
@thisurawz1 Through the following code, I successfully loaded the LoRA fine-tuned model for inference. Hope this helps you.
from videollama2 import model_init, mm_infer
from videollama2.utils import disable_torch_init
disable_torch_init()
modal = 'video'
modal_path = 'VideoLLaMA2/videollama2/serve/examples/sample_demo_1.mp4'
instruct = 'What is the baby wearing and what is he doing?'
model_path = 'VideoLLaMA2/work_dirs/videollama2/finetune_downstream_sft_settings_qlora_MESC' # your model dir
model, processor, tokenizer = model_init(model_path)
output = mm_infer(processor[modal](modal_path), instruct, model=model, tokenizer=tokenizer, do_sample=False, modal=modal)
Yes, you can. The newest version commit supports directly loading the Lora model.
Can you share the script on how to load the Lora model directly? I already finished the fine-tuning. And got those files. But I don't know how to do the inference with these.
Hello! I have the same problem. Have you solved it?
Yes, you can. The newest version commit supports directly loading the Lora model.
Can you share the script on how to load the Lora model directly? I already finished the fine-tuning. And got those files. But I don't know how to do the inference with these.
Hello! I have the same problem. Have you solved it?
yes. you have to update the videollama2 repository to the latest commit. then use the following script. just have to change the model path in the original inference script. thats all.
import sys
sys.path.append('./')
from videollama2 import model_init, mm_infer
from videollama2.utils import disable_torch_init
def inference():
disable_torch_init()
# Video Inference
modal = 'video'
modal_path = 'assets/cat_and_chicken.mp4'
instruct = 'What animals are in the video, what are they doing, and how does the video feel?'
# Reply:
# The video features a kitten and a baby chick playing together. The kitten is seen laying on the floor while the baby chick hops around. The two animals interact playfully with each other, and the video has a cute and heartwarming feel to it.
# Image Inference
modal = 'image'
modal_path = 'assets/sora.png'
instruct = 'What is the woman wearing, what is she doing, and how does the image feel?'
# Reply:
# The woman in the image is wearing a black coat and sunglasses, and she is walking down a rain-soaked city street. The image feels vibrant and lively, with the bright city lights reflecting off the wet pavement, creating a visually appealing atmosphere. The woman's presence adds a sense of style and confidence to the scene, as she navigates the bustling urban environment.
model_path = 'DAMO-NLP-SG/VideoLLaMA2-7B'
# Base model inference (only need to replace model_path)
# model_path = 'work_dirs/videollama2/finetune_downstream_sft_settings_qlora' #your fine-tuned weights directory
model, processor, tokenizer = model_init(model_path)
output = mm_infer(processor[modal](modal_path), instruct, model=model, tokenizer=tokenizer, do_sample=False, modal=modal)
print(output)
if __name__ == "__main__":
inference()
Can you share the script on how to load the Lora model directly? I already finished the fine-tuning. And got those files. But I don't know how to do the inference with these.
@thisurawz1 Through the following code, I successfully loaded the LoRA fine-tuned model for inference. Hope this helps you.
from videollama2 import model_init, mm_infer from videollama2.utils import disable_torch_init disable_torch_init() modal = 'video' modal_path = 'VideoLLaMA2/videollama2/serve/examples/sample_demo_1.mp4' instruct = 'What is the baby wearing and what is he doing?' model_path = 'VideoLLaMA2/work_dirs/videollama2/finetune_downstream_sft_settings_qlora_MESC' # your model dir model, processor, tokenizer = model_init(model_path) output = mm_infer(processor[modal](modal_path), instruct, model=model, tokenizer=tokenizer, do_sample=False, modal=modal)
Thank you so much
Yes, you can. The newest version commit supports directly loading the Lora model.
Can you share the script on how to load the Lora model directly? I already finished the fine-tuning. And got those files. But I don't know how to do the inference with these.
Hello! I have the same problem. Have you solved it?
yes. you have to update the videollama2 repository to the latest commit. then use the following script. just have to change the model path in the original inference script. thats all.
import sys sys.path.append('./') from videollama2 import model_init, mm_infer from videollama2.utils import disable_torch_init def inference(): disable_torch_init() # Video Inference modal = 'video' modal_path = 'assets/cat_and_chicken.mp4' instruct = 'What animals are in the video, what are they doing, and how does the video feel?' # Reply: # The video features a kitten and a baby chick playing together. The kitten is seen laying on the floor while the baby chick hops around. The two animals interact playfully with each other, and the video has a cute and heartwarming feel to it. # Image Inference modal = 'image' modal_path = 'assets/sora.png' instruct = 'What is the woman wearing, what is she doing, and how does the image feel?' # Reply: # The woman in the image is wearing a black coat and sunglasses, and she is walking down a rain-soaked city street. The image feels vibrant and lively, with the bright city lights reflecting off the wet pavement, creating a visually appealing atmosphere. The woman's presence adds a sense of style and confidence to the scene, as she navigates the bustling urban environment. model_path = 'DAMO-NLP-SG/VideoLLaMA2-7B' # Base model inference (only need to replace model_path) # model_path = 'work_dirs/videollama2/finetune_downstream_sft_settings_qlora' #your fine-tuned weights directory model, processor, tokenizer = model_init(model_path) output = mm_infer(processor[modal](modal_path), instruct, model=model, tokenizer=tokenizer, do_sample=False, modal=modal) print(output) if __name__ == "__main__": inference()
Thank you, I will try this.
Yes, you can. The newest version commit supports directly loading lora model.
Dear author,I used your lora checkpoint folder structure and loading example code(#36) to my fintue_qlora inference code on my own experiment video data, but it still has some errors.The old inference code of readme file is work.I just put your code in the code. Please help me!
1: My fintue_qlora inference code:
import torch
import transformers
import sys
sys.path.append('./')
from videollama2.conversation import conv_templates
from videollama2.constants import DEFAULT_MMODAL_TOKEN, MMODAL_TOKEN_INDEX
from videollama2.mm_utils import get_model_name_from_path, tokenizer_MMODAL_token, process_video, process_image
from videollama2.model.builder import load_pretrained_model
def inference():
# Video Inference
paths = ['./datasets/test_data/videos/video_202.mp4']
questions = ['hidden****']
# Reply:
modal_list = ['video']
# Image Inference
#paths = ['assets/sora.png']
#questions = ['What is the woman wearing, what is she doing, and how does the image feel?']
# Reply:
# The woman in the image is wearing a black coat and sunglasses, and she is walking down a rain-soaked city street. The image feels vibrant and lively, with the bright city lights reflecting off the wet pavement, creating a visually appealing atmosphere. The woman's presence adds a sense of style and confidence to the scene, as she navigates the bustling urban environment.
#modal_list = ['image']
# 1. Initialize the model.
model_path = './checkpoints/VideoLLaMA2-7B-qlora' #./checkpoints/VideoLLaMA2-7B
# Base model inference (only need to replace model_path)
# model_path = 'DAMO-NLP-SG/VideoLLaMA2-7B-Base'
model_name = get_model_name_from_path(model_path)
tokenizer, model, processor, context_len = load_pretrained_model(model_path, './checkpoints/Mistral-7B-Instruct-v0.2', model_name) # None
model = model.to('cuda:0')
conv_mode = 'llama2'
# 2. Visual preprocess (load & transform image or video).
if modal_list[0] == 'video':
tensor = process_video(paths[0], processor, model.config.image_aspect_ratio).to(dtype=torch.float16, device='cuda', non_blocking=True)
default_mm_token = DEFAULT_MMODAL_TOKEN["VIDEO"]
modal_token_index = MMODAL_TOKEN_INDEX["VIDEO"]
else:
tensor = process_image(paths[0], processor, model.config.image_aspect_ratio)[0].to(dtype=torch.float16, device='cuda', non_blocking=True)
default_mm_token = DEFAULT_MMODAL_TOKEN["IMAGE"]
modal_token_index = MMODAL_TOKEN_INDEX["IMAGE"]
tensor = [tensor]
# 3. text preprocess (tag process & generate prompt).
question = default_mm_token + "\n" + questions[0]
conv = conv_templates[conv_mode].copy()
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
input_ids = tokenizer_MMODAL_token(prompt, tokenizer, modal_token_index, return_tensors='pt').unsqueeze(0).to('cuda:0')
with torch.inference_mode():
output_ids = model.generate(
input_ids,
images_or_videos=tensor,
modal_list=modal_list,
do_sample=True,
temperature=0.2,
max_new_tokens=1024,
use_cache=True,
)
outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
print(outputs[0])
if name == "main":
inference()
2: Terminal errors:
(videollama2) lm@SR6430G23:~/videollama2/VideoLLaMA2$ /home/lm/anaconda3/envs/videollama2/bin/python inference.py
200
Loading VideoLLaMA from base model...
Loading checkpoint shards: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 3/3 [00:13<00:00, 4.36s/it]
Some weights of Videollama2MistralForCausalLM were not initialized from the model checkpoint at ./checkpoints/Mistral-7B-Instruct-v0.2 and are newly initialized: ['model.mm_projector.readout.0.bias', 'model.mm_projector.readout.0.weight', 'model.mm_projector.readout.2.bias', 'model.mm_projector.readout.2.weight', 'model.mm_projector.s1.b1.conv1.bn.bias', 'model.mm_projector.s1.b1.conv1.bn.weight', 'model.mm_projector.s1.b1.conv1.conv.weight', 'model.mm_projector.s1.b1.conv2.bn.bias', 'model.mm_projector.s1.b1.conv2.bn.weight', 'model.mm_projector.s1.b1.conv2.conv.weight', 'model.mm_projector.s1.b1.conv3.bn.bias', 'model.mm_projector.s1.b1.conv3.bn.weight', 'model.mm_projector.s1.b1.conv3.conv.weight', 'model.mm_projector.s1.b1.downsample.bn.bias', 'model.mm_projector.s1.b1.downsample.bn.weight', 'model.mm_projector.s1.b1.downsample.conv.weight', 'model.mm_projector.s1.b1.se.fc1.bias', 'model.mm_projector.s1.b1.se.fc1.weight', 'model.mm_projector.s1.b1.se.fc2.bias', 'model.mm_projector.s1.b1.se.fc2.weight', 'model.mm_projector.s1.b2.conv1.bn.bias', 'model.mm_projector.s1.b2.conv1.bn.weight', 'model.mm_projector.s1.b2.conv1.conv.weight', 'model.mm_projector.s1.b2.conv2.bn.bias', 'model.mm_projector.s1.b2.conv2.bn.weight', 'model.mm_projector.s1.b2.conv2.conv.weight', 'model.mm_projector.s1.b2.conv3.bn.bias', 'model.mm_projector.s1.b2.conv3.bn.weight', 'model.mm_projector.s1.b2.conv3.conv.weight', 'model.mm_projector.s1.b2.se.fc1.bias', 'model.mm_projector.s1.b2.se.fc1.weight', 'model.mm_projector.s1.b2.se.fc2.bias', 'model.mm_projector.s1.b2.se.fc2.weight', 'model.mm_projector.s1.b3.conv1.bn.bias', 'model.mm_projector.s1.b3.conv1.bn.weight', 'model.mm_projector.s1.b3.conv1.conv.weight', 'model.mm_projector.s1.b3.conv2.bn.bias', 'model.mm_projector.s1.b3.conv2.bn.weight', 'model.mm_projector.s1.b3.conv2.conv.weight', 'model.mm_projector.s1.b3.conv3.bn.bias', 'model.mm_projector.s1.b3.conv3.bn.weight', 'model.mm_projector.s1.b3.conv3.conv.weight', 'model.mm_projector.s1.b3.se.fc1.bias', 'model.mm_projector.s1.b3.se.fc1.weight', 'model.mm_projector.s1.b3.se.fc2.bias', 'model.mm_projector.s1.b3.se.fc2.weight', 'model.mm_projector.s1.b4.conv1.bn.bias', 'model.mm_projector.s1.b4.conv1.bn.weight', 'model.mm_projector.s1.b4.conv1.conv.weight', 'model.mm_projector.s1.b4.conv2.bn.bias', 'model.mm_projector.s1.b4.conv2.bn.weight', 'model.mm_projector.s1.b4.conv2.conv.weight', 'model.mm_projector.s1.b4.conv3.bn.bias', 'model.mm_projector.s1.b4.conv3.bn.weight', 'model.mm_projector.s1.b4.conv3.conv.weight', 'model.mm_projector.s1.b4.se.fc1.bias', 'model.mm_projector.s1.b4.se.fc1.weight', 'model.mm_projector.s1.b4.se.fc2.bias', 'model.mm_projector.s1.b4.se.fc2.weight', 'model.mm_projector.s2.b1.conv1.bn.bias', 'model.mm_projector.s2.b1.conv1.bn.weight', 'model.mm_projector.s2.b1.conv1.conv.weight', 'model.mm_projector.s2.b1.conv2.bn.bias', 'model.mm_projector.s2.b1.conv2.bn.weight', 'model.mm_projector.s2.b1.conv2.conv.weight', 'model.mm_projector.s2.b1.conv3.bn.bias', 'model.mm_projector.s2.b1.conv3.bn.weight', 'model.mm_projector.s2.b1.conv3.conv.weight', 'model.mm_projector.s2.b1.se.fc1.bias', 'model.mm_projector.s2.b1.se.fc1.weight', 'model.mm_projector.s2.b1.se.fc2.bias', 'model.mm_projector.s2.b1.se.fc2.weight', 'model.mm_projector.s2.b2.conv1.bn.bias', 'model.mm_projector.s2.b2.conv1.bn.weight', 'model.mm_projector.s2.b2.conv1.conv.weight', 'model.mm_projector.s2.b2.conv2.bn.bias', 'model.mm_projector.s2.b2.conv2.bn.weight', 'model.mm_projector.s2.b2.conv2.conv.weight', 'model.mm_projector.s2.b2.conv3.bn.bias', 'model.mm_projector.s2.b2.conv3.bn.weight', 'model.mm_projector.s2.b2.conv3.conv.weight', 'model.mm_projector.s2.b2.se.fc1.bias', 'model.mm_projector.s2.b2.se.fc1.weight', 'model.mm_projector.s2.b2.se.fc2.bias', 'model.mm_projector.s2.b2.se.fc2.weight', 'model.mm_projector.s2.b3.conv1.bn.bias', 'model.mm_projector.s2.b3.conv1.bn.weight', 'model.mm_projector.s2.b3.conv1.conv.weight', 'model.mm_projector.s2.b3.conv2.bn.bias', 'model.mm_projector.s2.b3.conv2.bn.weight', 'model.mm_projector.s2.b3.conv2.conv.weight', 'model.mm_projector.s2.b3.conv3.bn.bias', 'model.mm_projector.s2.b3.conv3.bn.weight', 'model.mm_projector.s2.b3.conv3.conv.weight', 'model.mm_projector.s2.b3.se.fc1.bias', 'model.mm_projector.s2.b3.se.fc1.weight', 'model.mm_projector.s2.b3.se.fc2.bias', 'model.mm_projector.s2.b3.se.fc2.weight', 'model.mm_projector.s2.b4.conv1.bn.bias', 'model.mm_projector.s2.b4.conv1.bn.weight', 'model.mm_projector.s2.b4.conv1.conv.weight', 'model.mm_projector.s2.b4.conv2.bn.bias', 'model.mm_projector.s2.b4.conv2.bn.weight', 'model.mm_projector.s2.b4.conv2.conv.weight', 'model.mm_projector.s2.b4.conv3.bn.bias', 'model.mm_projector.s2.b4.conv3.bn.weight', 'model.mm_projector.s2.b4.conv3.conv.weight', 'model.mm_projector.s2.b4.se.fc1.bias', 'model.mm_projector.s2.b4.se.fc1.weight', 'model.mm_projector.s2.b4.se.fc2.bias', 'model.mm_projector.s2.b4.se.fc2.weight', 'model.mm_projector.sampler.0.bias', 'model.mm_projector.sampler.0.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Loading additional VideoLLaMA weights...
Loading LoRA weights...
Merging LoRA weights...
Model is loaded...
Loading VideoLLaMA 2 from base model...
You are using a model of type mistral to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.
Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]
Traceback (most recent call last):
File "inference.py", line 166, in
inference()
File "inference.py", line 127, in inference
tokenizer, model, processor, context_len = load_pretrained_model(model_path, './checkpoints/Mistral-7B-Instruct-v0.2', model_name) # None
File "/home/lm/videollama2/VideoLLaMA2/videollama2/model/builder.py", line 140, in load_pretrained_model
model = Videollama2MistralForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=cfg_pretrained, **kwargs)
File "/home/lm/anaconda3/envs/videollama2/lib/python3.8/site-packages/transformers/modeling_utils.py", line 3754, in from_pretrained
) = cls._load_pretrained_model(
File "/home/lm/anaconda3/envs/videollama2/lib/python3.8/site-packages/transformers/modeling_utils.py", line 4214, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/home/lm/anaconda3/envs/videollama2/lib/python3.8/site-packages/transformers/modeling_utils.py", line 889, in _load_state_dict_into_meta_model
hf_quantizer.create_quantized_param(model, param, param_name, param_device, state_dict, unexpected_keys)
File "/home/lm/anaconda3/envs/videollama2/lib/python3.8/site-packages/transformers/quantizers/quantizer_bnb_4bit.py", line 190, in create_quantized_param
raise ValueError(
ValueError: Supplied state dict for model.layers.0.mlp.down_proj.weight does not contain bitsandbytes__*
and possibly other quantized_stats
components.
Can you share the script on how to load the Lora model directly? I already finished the fine-tuning. And got those files. But I don't know how to do the inference with these.
@thisurawz1 Through the following code, I successfully loaded the LoRA fine-tuned model for inference. Hope this helps you.
from videollama2 import model_init, mm_infer from videollama2.utils import disable_torch_init disable_torch_init() modal = 'video' modal_path = 'VideoLLaMA2/videollama2/serve/examples/sample_demo_1.mp4' instruct = 'What is the baby wearing and what is he doing?' model_path = 'VideoLLaMA2/work_dirs/videollama2/finetune_downstream_sft_settings_qlora_MESC' # your model dir model, processor, tokenizer = model_init(model_path) output = mm_infer(processor[modal](modal_path), instruct, model=model, tokenizer=tokenizer, do_sample=False, modal=modal)
Hello,I'm a phD student from ZJU, I also use videollama2 to do my own research,we create a WeChat group to discuss some issues of videollama2 and help each other,could you join us? Please contact me: WeChat number == LiangMeng19357260600, phone number == +86 19357260600,e-mail == liangmeng89@zju.edu.cn.