/BLIP2-Japanese

Modifying LAVIS' BLIP2 Q-former with models pretrained on Japanese datasets.

Primary LanguagePythonBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

BLIP2-Japanese

This project builds upon LAVIS library's BLIP2 mdoel.

The main idea is to replace the tokenizer and the underlying BERT model in Blip2's Qformer with the one trained on Japanese datasets and retrain the upated model on Japanese captioning datasets.

The model has been trained using COCO dataset with STAIR captions.

Quick Start

The weights of Blip2_Japanese_qformer trained on STAIR can be obtained from this link.

Copy the whole folder under lavis directory, make sure the directory is called pretrained.

Moreover, download bert-base-japanese-whole-word-masking weights and config from the hugging face link

You should now be able to run the example.ipynb notebook.

For directory naming conventions, you can also refer to the .gitignore file.

Use Case: Generate Japanese Captions for Captioning Datasets

Captions generated for flickr30k dataset can be found in flickr30k_caption.json. Script in flickr30k_caption_generate.ipynb.

These captions are generated using top-k sampling instead of nucleus.

Captions generated by the pretrained and finetuned models are shown below:

1001773457

pretrained: {'image': '1001773457.jpg', 'caption': ['二 匹 の 犬 が 道路 で フリスビー を し て いる']} # No frisbee

finetuned: {'image': '1001773457.jpg', 'caption': ['二 匹 の 犬 が 道路 で 喧嘩 を し て いる']}

1001573224

pretrained: {'image': '1001573224.jpg', 'caption': ['6 人 の 女性 が 屋内 で 飛び跳ね て いる']} # Wrong head count

finetuned: {'image': '1001573224.jpg', 'caption': ['黒い 服 を 着 た 女性 たち が 飛び跳ね て いる']}

In general, captions generated by the finetuned model are more accurate.

Use Case: Image Retrieval

Refer to the example.ipynb notebooks for more details. The idea is to get the average cosine similarity of query tokens between the image embeddings and the multimodal embeddings.

Model training

The model was trained on a single GTX4080 GPU(laptop). Hence the config during training is modified as follows:

In blip2_pretrain.yaml: vit_precision = 'fp16'

In pretrain_stage1.yaml: batch_size = 25

During evaluation you have to change vit_precision back to fp32.

The pretrained and finetuned weights may be updated without prior notice. So if you cannot reproduce the results in the exmaple notebook, please re-download the weights and try again.

User Interface for Japanese Caption Generator

A simple interface for demo purpose can be found in generator-ui.py. To run the UI:

   python generator-ui.py

image