/control-generate-augment

Repository for Giuseppe Russo's master thesis code.

Primary LanguagePython

Control, Generate and Augment: A Scalabel Framework for Multi-Attributes Controlled Text Generation

Introduction

PyTorch code for the Findings of EMNLP 2020 paper "Control, Generate and Augment: A Scalabel Framework for Multi-Attributes Controlled Text Generation". The camera-ready version of the paper is accesible here.

Data Download

Please download the YELP restaurants review data from here and the IMDB 50K movie review from here. The preprocessing of the data can be executed following the procedure explained in the supplementary materials of the paper "Control, Generate and Augment: A Scalabel Framework for Multi-Attributes Controlled Text Generation"

Data Preprocessing

To obtain the multi-attributes dataset used please run first

python TenseLabeling.py

and second

python PronounLabeling.py

Model

Training

to train the model please run

python Analysis.py

All the parameters to obtain the results reported in the paper are set as default values. The model trained is saved in the bin folder. The name used is the date and the time the experiment is started

Generation

To generate new sentences simply run

python generation.py

The default parameters for this script let generate sentences with all possible combinations of attributes. For specifically attributes, please specify the examples desired.

Evaluation

All these scripts are in the Evalution folder

Data Augmentation

For the Data Augmentation Evaluation please run

python AugmentData.py

to generate all the combinations of augmented data for each of the starting training size in the paper. Afterwards run

python GPU_DAE.py

to obtain the validation and test results for the data augmentation experiment.

Attribute Matching

please run the script

python AttrMatch.py

to obtain all the different attribute matching accuracy for the generated sentences

Sentence Embedding Similarity

python UniversalSentenceEvaluator.py

Model Checkpoints

In the folder Generated you will find examples of our generated sentences, running both single and multi-attribute controls. In addition, the model checkpoints for each of these experiments are provided alongside with the parameters used for the experiments

Reproducibility Information

Description of computing infrastructure used:

All models presented in this work were implemented in PyTorch, and trained and tested on single Titan XP GPUs with 12GB memory.

Average runtime for each approach:

The average runtime was 07:26:14 for the model trained with YELP. The average runtime was 04:09:54 for the model trained with IMDB.

Number of Parameters for each model

Dataset S-VAE (Generator) Discriminator
YELP 3.417.176 4452
IMDB 4.433.176 4470