Z-LaVI

Official code for the paper "Z-LaVI: Zero-Shot Language Solver Fueled by Visual Imagination".

Set up environments

We run our experiments using Python 3.9.12. You can install the required packages using:

pip install -r requirements.txt

Directories

/data stores the processed data of all datasets, including AG News, ARC, CoarseWSD-20, QASC, SciQ, Situation, ViComTe.
/model stores the code to run the models.
/output save the models' predictions.
/scripts some useful scripts.

Get single model performace

First, you need to get the performance of all language models and clip. We have three categories of language models: prompt-based, latent-embedding-based, NLI-based and we evaluate on 8 datasets.

Models:
- Prompt-based: gpt-neo-1.3B, gpt-neo-2.7B, gpt-j-6B, opt-30b
- Latent-based: simcse, sbert
- NLI-based: roberta, bart
Datasets: ag_news, situation, arc_easy, arc_challenge, qasc, sciq, coarse_wsd, vicomte.

Navigate the /model directory and then:

Prompt-based language models

/model/prompt.py is code for prompt-based models, to get the predictions, run:

python prompt.py dataset model

replace the dataset and model with the exact name.

Latent-embedding-based language models

/model/latent.py is code for latent-embedding-based models, to get the predictions, run:

python latent.py dataset model

replace the dataset and model with the exact name.

NLI-based language models

/model/nli.py is code for NLI-based models, to get the predictions, run:

python nli.py dataset model

replace the dataset and model with the exact name.

CLIP

To run the clip model, you first need to download the image embeddings here. After unzip it in the /Z-LaVI folder, then you can run:

python clip_zs.py dataset imagine_type

We provide three imagine_type: synthesis, recall and combine which means generated images, web images and combine both types of images.

Evaluate

You can evaluate single model's performance using /model/evaluate.py, run:

python evaluate.py dataset model

replace the dataset and model with the exact name. You will see the printed output in the terminal in the following format:

Dataset: dataset
Model: model
Performance:
Metric: number

Ensemble

You can ensemble the clip with any language models using model/ensemble.py:

python ensemble.py dataset clip_{imagine_type} language_model weight

You will see the performance of clip, language model and their ensembled performance.

Useful Scripts

scripts/bing-vis-search.py is the code to download images from BING. You need to first have a paid Azure account and replace the subscription key in line 26. The prepare your queries in a txt file seperated by \n. Then run:

python bing-vis-search.py -f your_queries.txt --threads 200 --limit 300 -o output_dir

--limit is the number of images you want to download, but some images are not downloadable, so may set this number higher than what you want.

scripts/dalle_generation is the code to use DALLE-mini to generate images. You need to first download the required packages follow the official repo. Then you can use the main function in the code to generate images:

prompt = "A photo of apple."
number_of_images = 8
images = main(prompt, number_of_images)
for i, image in enumerate(images):
    image.save("apple_{}.jpg".format(i))

Download the generated images

We released all the images generated by DALLE-mini, you can download them from here (21.43GB).

Citation

Please cite our work if you find it useful:

@inproceedings{yang-etal-2022-z,
    title = "{Z}-{L}a{VI}: Zero-Shot Language Solver Fueled by Visual Imagination",
    author = "Yang, Yue  and
      Yao, Wenlin  and
      Zhang, Hongming  and
      Wang, Xiaoyang  and
      Yu, Dong  and
      Chen, Jianshu",
    booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.emnlp-main.78",
    pages = "1186--1203",
    abstract = "Large-scale pretrained language models have made significant advances in solving downstream language understanding tasks. However, they  generally suffer from reporting bias, the phenomenon describing the lack of explicit commonsense knowledge in written text, e.g., {''}an orange is orange{''}. To overcome this limitation, we develop a novel approach, Z-LaVI, to endow language models with visual imagination capabilities. Specifically, we leverage two complementary types of {''}imaginations{''}: (i) recalling existing images through retrieval and (ii) synthesizing nonexistent images via text-to-image generation. Jointly exploiting the language inputs and the imagination, a pretrained vision-language model (e.g., CLIP) eventually composes a zero-shot solution to the original language tasks. Notably, fueling language models with imagination can effectively leverage visual knowledge to solve plain language tasks. In consequence, Z-LaVI consistently improves the zero-shot performance of existing language models across a diverse set of language tasks.",
}

YueYANG1996/Z-LaVI

Z-LaVI

Set up environments

Directories

Get single model performace

Prompt-based language models

Latent-embedding-based language models

NLI-based language models

CLIP

Evaluate

Ensemble

Useful Scripts

Download the generated images

Citation