/visToT

Primary LanguagePythonMIT LicenseMIT

VisToT

Code for VisToT: Vision-Augmented Table-to-Text Generation (EMNLP 2022) paper.

Project Page | Paper

Requirements

Setup the environment

# create a new conda environment
conda create -n vt3 python=3.8.13

# activate environment
conda activate vt3

# install pytorch
conda install pytorch==1.7.0 torchvision==0.8.0 torchaudio==0.7.0 cudatoolkit=11.0 -c pytorch

# install other dependencies
pip install -r requirements.txt

Dataset Preparation

Download the dataset here. Move the downloaded content to ./data_wikilandmark directory and unzip the files.

# go to directory
cd data_wikilandmark

Image Feature Extraction

python extract_swin_features.py --image_dir ./images/ --output_dir ./image_features/

Prepare dataset

./generate_dataset_files.sh

Training & Evaluation

The scripts for training and evaluation are in ./VT3 directory.

Download trained checkpoint for VT3: Google Drive Link

Training

# pretrain VT3 model and save checkpoint
./pretrain.sh

# finetune VT3 model on wikilandmarks dataset using saved checkpoint
./train.sh

Evaluation

# perform inference on test data
./perform_inference.sh

Cite

If you find this work useful for your research, please consider citing.

@inproceedings{vistot2022emnlp,
  author    = "Gatti, Prajwal and 
              Mishra, Anand and
              Gupta, Manish and
              Das Gupta, Mithun"
  title     = "VisToT: Vision-Augmented Table-to-Text Generation",
  booktitle = "EMNLP",
  year      = "2022",
}

Acknowledgements

This implementation is based on the code provided by https://github.com/yxuansu/PlanGen.
Code provided by https://github.com/j-min/VL-T5/blob/main/VL-T5/src/modeling_bart.py helped in implementing VT3 transformer.
Swin Transformer used for feature extraction was provided by https://huggingface.co/docs/transformers/model_doc/swin.

License

This code is released under the MIT license.