Automatic Generation on Food Image Aesthetic Captioning

PyTorch implementation for paper To be an Artist: Automatic Generation on Food Image Aesthetic Captioning (ICTAI 2020).

Code is provided as-is, no updates expected.

Requirements

Make sure your environment is installed with:

Python 3.5+
java 1.8.0 (for computing METEOR and SPICE)

Then install requirements:

pip install -r requirements.txt

Usage

Configuration

Hyperparameters and options can be configured in config.py, see this file for more details.

Preprocess

Preprocess the images along with their captions and store them locally:

python preprocess.py

Single-Aspect Captioning Module

Single-Aspect Captioning Module is guaranteed to generate the captions and learn the feature representations of each aesthetic attribute.

To run train:

python single_train.py

To run test and compute metrics, edit beam_size in single_test.py, then:

python single_test.py

To run inference, edit image_path and beam_size in single_infer.py, then:

python single_infer.py

Multi-Aspect Captioning Module

Multi-Aspect Captioning Module is supposed to study the associations among all feature representations and automatically aggregate captions of all aesthetic attributes to a final sentence.

To run train:

python multi_train.py

To run test and compute metrics, edit model_path and multi_beam_k in multi_test.py, then:

python multi_test.py

To run inference, edit image_path and multi_beam_k in multi_infer.py, then:

python multi_infer.py

Dataset

A dataset for food image aesthetic captioning was constructed to evaluate the proposed method, see here for details.

NOTES

Followed the experiment settings in a previous work, we pre-trained our single-aspect captioning module on the MSCOCO image captioning dataset first, and then fine-tuned on our dataset.
The load_embeddings method (in src/utils/embedding.py) will try to create a cache for loaded embeddings under folder dataset_output_path. This dramatically speeds up the loading time the next time.
You will first need to download the Stanford CoreNLP 3.6.0 code and models for use by SPICE. To do this, run: cd src/metrics && bash get_stanford_models.sh.

Acknowledgements

Implementation of single-aspect captioning module is based on sgrvinod/a-PyTorch-Tutorial-to-Image-Captioning.
Implementation of multi-aspect captioning module is based on zphang/usc_dae.
Implementation of evaluation metrics is adopted from ruotianluo/coco-caption.

Renovamen/Food-IAC