Can we do the only text, image and text and video and text finetuning with lora in a one run

Question

Can we do the only text, image and text and video and text finetuning with lora in a one run

thisurawz1 opened this issue 4 months ago · 4 comments

Can we do only text, image-text, and video-text finetuning with Lora in one run? I mean, put only text, text-image, and video-image samples in the same custom.json file and do the fine-tuning?

Answer 1 · 2024-09-06T07:37:12.000Z

Yes, you can. The LazysupervisedDataset in train.py unifies the processing of pure text, image-text, video-text data sample.

Answer 2 · 2024-10-08T09:29:17.000Z

Can we do only text, image-text, and video-text finetuning with Lora in one run? I mean, put only text, text-image, and video-image samples in the same custom.json file and do the fine-tuning?

@thisurawz1 , can you share a sample format JSON which has only text, text-image, and video-image which you are using for finetuning

Answer 3 · 2024-10-15T02:42:16.000Z

Can we do only text, image-text, and video-text finetuning with Lora in one run? I mean, put only text, text-image, and video-image samples in the same custom.json file and do the fine-tuning?

@thisurawz1 , can you share a sample format JSON which has only text, text-image, and video-image which you are using for finetuning

image-text - id 0
image-video - id 1
text only - id 3

 [
    {
        "id": 0,
        "image": "images/xxx.jpg",
        "conversations": [
            {
                "from": "human",
                "value": "<image>\nWhat are the colors of the bus in the image?"
            },
            {
                "from": "gpt",
                "value": "The bus in the image is white and red."
            }
        ]
    },
    {
        "id": 1,
        "video": "videos/xxx.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nWhat are the main activities that take place in the video?"
            },
            {
                "from": "gpt",
                "value": "The main activities that take place in the video are the preparation of camera equipment by a man, a group of men riding a helicopter, and a man sailing a boat through the water."
            }
        ],
    },
    {
        "id": 3,
        "model": "",
        "conversations": [
            {
                "from": "human",
                "value": "What are the colors of the bus in the image?"
            },
            {
                "from": "gpt",
                "value": "The bus in the image is white and red."
            }
        ]
    }
]

folder structure
VideoLLaMA2
├── datasets
│ ├── custom_sft
│ | ├── images
│ | ├── videos
| | └── custom.json

Answer 4 · 2024-10-15T06:42:17.000Z

Thanks @thisurawz1