baaivision/CapsFusion

Trained VLM (Emu) weights and training code

Opened this issue · 1 comments

Hi, thank you for providing the wonderful dataset in public!

According to the paper, Emu is used for the VLM model, and the model is trained on the proposed CapsFusion dataset.

Is it possible to provide the trained model and also training code? I checked the repository in Emu but I could not find the training code.
I want to reproduce the results of the Table 1 in the CapsFusion paper.

Thank you in advance!

Best regards,
Ryo

Thank you for your interest in our work!

The training code is not planned to release, but we think training LMMs on our dataset can achieve the performance in Table 1 easily. For LMMs training, you can refer to some well-established training frameworks like LLaVA, with the dataset replaced with ours.

Besides, as the training on CapsFusion is set just as an apple-to-apple comparison between datasets. Thus the model itself is not expected to achieve a strong performance and meaningless. The training is short (just 1 epoch) and no instruction-tuning datasets are used, leading to weak model performance in instruction following.

Another noteworthy thing is that due to the output format of the pretrained model is hard to control, we add some post-processing on model outputs before evaluation. You can also add such processing when comparing performance of pretrained models:

caption = caption.split('\n')[0]
caption = caption.split('. ')[0]
caption = caption if len(caption) == 0 or caption[-1] != '.' else caption[:-1]
caption = caption.lower()

Thank you.