/EchoSight

[EMNLP 2024 Findings] The official PyTorch implementation of EchoSight: Advancing Visual-Language Models with Wiki Knowledge.

Primary LanguagePython

EchoSight: Advancing Visual-Language Models with Wiki Knowledge (EMNLP 2024 Findings)

This is the official PyTorch implementation of EchoSight: Advancing Visual-Language Models with Wiki Knowledge.

[Project Page] [Paper]

image image

Requirements

  1. (Optional) Create conda environment
conda create -n echosight python=3.10
conda activate echosight
  1. Install the required packages
pip install -r requirements.txt

Knowledge Base

We provide the knowledge bases used in EchoSight. The knowledge base file is the same format as the Encyclopedic-VQA dataset. Apart from the original 2M knowledge base for Encyclopedic-VQA, we also provide a 100K knowledge base for InfoSeek, which is a filtered subset of the 2M knowledge base. The knowledge base files can be downloaded from the following links:

Enclyclopedic-VQA

Infoseek

VQA Questions

Encyclopedic VQA

The VQA questions can be downloaded in .csv format here(Provided by Encyclopedic-VQA):

To download the images in Encyclopedic-VQA:

InfoSeek

The VQA questions of InfoSeek are transformed to E-VQA format from the original InfoSeek dataset. Due to the The questions can be downloaded in .csv format here:

To download the images in InfoSeek:

Training

The multimodal reranker of EchoSight is trained using Encyclopedic-VQA datasets and the corresponding 2M Knowledge Base. If you want to enable Hard Negative Sampling when training the reranker, we provide our Hard_Neg result sampled by Eva-CLIP here:

To train the multimodal reranker, run the bash script after changing the necessary configurations.

bash scripts/train_reranker.sh

Script Details

The train_reranker.sh script is used to fine-tune the reranker module with specific parameters:

--blip-model-name: Name of the BLIP model to be used for reranking.

--num-epochs: Number of epochs for training. In this case, the model will be trained for 20 epochs.

--num-workers: Number of worker threads for data loading.

--learning-rate: Learning rate for the optimizer.

--batch-size: Number of samples per batch during training.

--transform: Transformation applied to the data. targetpad ensures the data is padded to a target size.

--target-ratio: Target aspect ratio for the padding transformation.

--save_frequency: Frequency (in steps) to save the model checkpoints.

--train_file: Path to the training data file. The training file should be the same format as provided by Encyclopedic-VQA.

--knowledge_base_file: Path to the knowledge base file in JSON format. The format should be the same with that of the Encyclopedic-VQA.

--negative_db_file: Path to the hard negative sampled database file used for training.

--inat_id2name: Path to the iNaturalist ID to name mapping file.

--save-training: Flag to save the training progress.

Inference

  1. Our reranker module weights can be downloaded at [Checkpoint].

  2. To perform inference with the trained model, run the provided test_reranker.sh script after adjusting the necessary parameters.

bash scripts/test_reranker.sh

Script Details

The test_reranker.sh script uses the following parameters for inference:

--test_file: Path to the test file.

--knowledge_base: Path to the knowledge base JSON file.

--faiss_index: Path to the FAISS index file for efficient similarity search.

--retriever_vit: Name of the visual transformer model used for initial retrieval. In the example script, eva-clip is used.

--top_ks: Comma-separated list of top-k recall results for retrieval (e.g., 1,5,10,20).

--retrieval_top_k: The top-k value used for retrieval.

--perform_qformer_reranker: Flag to perform reranking using QFormer.

--qformer_ckpt_path: Path to the QFormer checkpoint file.

--perform_qformer_reranker: Flag to perform the ultimate VQA.

--save_result: Flag to save the inference result.

--save_result_path: Path to the result json file would be saved.

--resume_from: Path to the retrieval result. If this parameter is used, the inference process will load the saved retrieval result, instead of using the retriever on-the-fly.

  1. (Optional) With the saved retrieval or reranked results, an answer generation can be performed standalone.
bash scripts/test_vqa.sh
  1. (Optional) Run the batch inference vqa script (Releasing Soon).

Script Details

The test_vqa.sh script uses the following parameters for inference:

--test_file: Path to the test file.

--retrieval_results: Path to the retrieval result file.

--answer_generator: Name of the answer generator model to be used. Choose from [Mistral, LLaMA3, GPT4, PaLM].

--llm_checkpoint: Path to the Mistral or LLaMA3 checkpoint file. If using GPT4 or PaLM, this parameter is not needed. Instead, change api_key in model/anwser_generator.py.

--output_file: Path to the output file. Default is ./answer.json.

Demo

Run the demo of EchoSight.

python app.py

Demo Showcase

image image

Citation

@inproceedings{yan-xie-2024-echosight,
    title = "{E}cho{S}ight: Advancing Visual-Language Models with {W}iki Knowledge",
    author = "Yan, Yibin  and
      Xie, Weidi",
    editor = "Al-Onaizan, Yaser  and
      Bansal, Mohit  and
      Chen, Yun-Nung",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-emnlp.83",
    pages = "1538--1551",
    abstract = "Knowledge-based Visual Question Answering (KVQA) tasks require answering questions about images using extensive background knowledge. Despite significant advancements, generative models often struggle with these tasks due to the limited integration of external knowledge. In this paper, we introduce **EchoSight**, a novel multimodal Retrieval-Augmented Generation (RAG) framework that enables large language models (LLMs) to answer visual questions requiring fine-grained encyclopedic knowledge. To strive for high-performing retrieval, EchoSight first searches wiki articles by using visual-only information, subsequently, these candidate articles are further reranked according to their relevance to the combined text-image query. This approach significantly improves the integration of multimodal knowledge, leading to enhanced retrieval outcomes and more accurate VQA responses. Our experimental results on the E-VQA and InfoSeek datasets demonstrate that EchoSight establishes new state-of-the-art results in knowledge-based VQA, achieving an accuracy of 41.8{\%} on E-VQA and 31.3{\%} on InfoSeek.",
}

Acknowledgements

Thanks to the code of LAVIS and data of Encyclopedic-VQA and InfoSeek.