X-PLUG/mPLUG-Owl

Mplug_owl 2 support video training?

YuzhouPeng opened this issue · 3 comments

Dose mplug_owl 2 support video training?

You can decode video as multiple images for training.

You can decode video as multiple images for training.

Hi,
I tried to use image frames from a video as sequence of images and tried inferencing on multiple images as below:

image_tensor = process_images([image1, image2], image_processor)
query = "Summarize the images"
with torch.inference_mode():
output_ids = model.generate(
input_ids,
images=image_tensor,
do_sample=True,
temperature=temperature,
max_new_tokens=max_new_tokens,
streamer=streamer,
use_cache=True,
stopping_criteria=[stopping_criteria])

outputs = tokenizer.decode(output_ids[0, input_ids.shape[1]:]).strip()
print(outputs)

Even though the first line in the above code gives me a tensor of shape [2, 3, 448, 448], the summary generated by the model solely focus on the content of the image1. Is it the right way to do it?

You can decode video as multiple images for training.

Hi, I tried to use image frames from a video as sequence of images and tried inferencing on multiple images as below:

image_tensor = process_images([image1, image2], image_processor) query = "Summarize the images" with torch.inference_mode(): output_ids = model.generate( input_ids, images=image_tensor, do_sample=True, temperature=temperature, max_new_tokens=max_new_tokens, streamer=streamer, use_cache=True, stopping_criteria=[stopping_criteria])

outputs = tokenizer.decode(output_ids[0, input_ids.shape[1]:]).strip() print(outputs)

Even though the first line in the above code gives me a tensor of shape [2, 3, 448, 448], the summary generated by the model solely focus on the content of the image1. Is it the right way to do it?

I am also curious how to use image sequence to understand entire video, how to build context? @MAGAer13