
Primary LanguagePythonMIT LicenseMIT


Code for VisToT: Vision-Augmented Table-to-Text Generation (EMNLP 2022) paper.

Project Page | Paper


Setup the environment

# create a new conda environment
conda create -n vt3 python=3.8.13

# activate environment
conda activate vt3

# install pytorch
conda install pytorch==1.7.0 torchvision==0.8.0 torchaudio==0.7.0 cudatoolkit=11.0 -c pytorch

# install other dependencies
pip install -r requirements.txt

Dataset Preparation

Download the dataset here. Move the downloaded content to ./data_wikilandmark directory and unzip the files.

# go to directory
cd data_wikilandmark

Image Feature Extraction

python extract_swin_features.py --image_dir ./images/ --output_dir ./image_features/

Prepare dataset


Training & Evaluation

The scripts for training and evaluation are in ./VT3 directory.

Download trained checkpoint for VT3: Google Drive Link


# pretrain VT3 model and save checkpoint

# finetune VT3 model on wikilandmarks dataset using saved checkpoint


# perform inference on test data


If you find this work useful for your research, please consider citing.

  author    = "Gatti, Prajwal and 
              Mishra, Anand and
              Gupta, Manish and
              Das Gupta, Mithun"
  title     = "VisToT: Vision-Augmented Table-to-Text Generation",
  booktitle = "EMNLP",
  year      = "2022",


This implementation is based on the code provided by https://github.com/yxuansu/PlanGen.
Code provided by https://github.com/j-min/VL-T5/blob/main/VL-T5/src/modeling_bart.py helped in implementing VT3 transformer.
Swin Transformer used for feature extraction was provided by https://huggingface.co/docs/transformers/model_doc/swin.


This code is released under the MIT license.