Hierarchical Sketch Induction for Paraphrase Generation

This repo contains the code for the paper Hierarchical Sketch Induction for Paraphrase Generation, by Tom Hosking, Hao Tang & Mirella Lapata (ACL 2022).

We propose a generative model of paraphrase generation, that encourages syntactic diversity by conditioning on an explicit syntactic sketch. We introduce Hierarchical Refinement Quantized Variational Autoencoders (HRQ-VAE), a method for learning decompositions of dense encodings as a sequence of discrete latent variables that make iterative refinements of increasing granularity. This hierarchy of codes is learned through end-to-end training, and represents fine-to-coarse grained information about the input. We use HRQ-VAE to encode the syntactic form of an input sentence as a path through the hierarchy, allowing us to more easily predict syntactic sketches at test time. Extensive experiments, including a human evaluation, confirm that HRQ-VAE learns a hierarchical representation of the input space, and generates paraphrases of higher quality than previous systems.

Installation

First, create a fresh virtualenv and install TorchSeq and other dependencies:

pip install -r requirements.txt

Then download (or create) the datasets/checkpoints you want to work with:

Download our split of Paralex

Download our split of QQP

Download our split of MSCOCO

Download a pretrained checkpoint for Paralex

Download a pretrained checkpoint for QQP

Download a pretrained checkpoint for MSCOCO

Checkpoint zip files should be unzipped into ./models, eg ./models/hrqvae_qqp. Data zip files should be unzipped into ./data/.

Note: Paralex was originally scraped from WikiAnswers, so many of the Paralex models and datasets are labelled as 'wa' or WikiAnswers.

Inference with pre-trained checkpoints

To replicate our results (eg for QQP), have a look at the example in ./examples/Replication-QQP.ipynb.

Inference on a custom dataset

You can also run the model on your own data:

import json
from torchseq.agents.para_agent import ParaphraseAgent
from torchseq.datasets.json_loader import JsonDataLoader
from torchseq.utils.config import Config

import torch

# Which checkpoint should we load?
path_to_model = './models/hrqvae_paralex/'
path_to_data = './data/'

# Define the data
examples = [
    {'input': 'What is the income for a soccer player?'},
    {'input': 'What do soccer players earn?'}
]


# Change the config to use the custom dataset
with open(path_to_model + "/config.json") as f:
    cfg_dict = json.load(f)
cfg_dict["dataset"] = "json"
cfg_dict["json_dataset"] = {
    "path": None,
    "field_map": [
        {"type": "copy", "from": "input", "to": "s2"},
        {"type": "copy", "from": "input", "to": "s1"},
    ],
}

# Enable the code predictor
cfg_dict["bottleneck"]["code_predictor"]["infer_codes"] = True

# Create the dataset and model
config = Config(cfg_dict)
data_loader = JsonDataLoader(config, test_samples=examples, data_path=path_to_data)
checkpoint_path = path_to_model + "/model/checkpoint.pt"
instance = ParaphraseAgent(config=config, run_id=None, output_path=None, data_path=path_to_data, silent=True, verbose=False, training_mode=False)

# Load the checkpoint
instance.load_checkpoint(checkpoint_path)
instance.model.eval()
    
# Finally, run inference
_, _, (pred_output, _, _), _ = instance.inference(data_loader.test_loader)

print(pred_output)

['what is the salary for a soccer player?', 'what do soccer players earn?']

Inference with multiple outputs

If you want to generate multiple diverse paraphrases for each input (aka 'top-k' inference), have a look at ./examples/topk.ipynb.

Training from scratch

Train a fresh checkpoint using:

torchseq --train --config ./configs/hrqvae_paralex.json

Training on a new dataset

To use a different dataset, you will need to generate a total of 4 datasets. These should be folders in ./data, containing {train,dev,test}.jsonl files.

An example of this process is given in ./scripts/MSCOCO.ipynb.

A cluster dataset, that is a list of the paraphrase clusters

{"qs": ["What are some good science documentaries?", "What is a good documentary on science?", "What is the best science documentary you have ever watched?", "Can you recommend some good documentaries in science?", "What the best science documentaries?"]}
{"qs": ["What do we use water for?", "Why do we, as human beings, use water for?"]}
...

A flattened dataset, that is just a list of all the paraphrases

The sentences must be in the same order as in the cluster dataset!

{"q": "Can you recommend some good documentaries in science?"}
{"q": "What the best science documentaries?"}
{"q": "What do we use water for?"}
...

The training triples

Generate this using the following command for question datasets:

python3 ./scripts/generate_training_triples.py  --use_diff_templ_for_sem --rate 1.0 --sample_size 26 --extended_stopwords  --real_exemplars --exhaustive --template_dropout 0.3 --dataset qqp-clusters --min_samples 0

Or this command for other datasets:

python3 ./scripts/generate_training_triples.py  --use_diff_templ_for_sem --rate 1.0 --sample_size 26 --pos_templates --extended_stopwords --no_stopwords  --real_exemplars --exhaustive --template_dropout 0.3 --dataset mscoco-clusters --min_samples 0

Replace qqp-clusters with the path to your dataset in "cluster" format.

A dataset to use for evaluation

For each cluster, select a single sentence to use as the input (assigned to sem_input) and add all the other references to paras. tgt and syn_input should be set to one of references.

{"tgt": "What are some good science documentaries?", "syn_input": "What are some good science documentaries?", "sem_input": "Can you recommend some good documentaries in science?", "paras": ["What are some good science documentaries?", "What the best science documentaries?", "What is the best science documentary you have ever watched?", "What is a good documentary on science?"]}
{"tgt": "What do we use water for?", "syn_input": "What do we use water for?", "sem_input": "Why do we, as human beings, use water for?", "paras": ["What do we use water for?"]}
...

Train the model

Have a look at the config files, eg configs/hrqvae_qqp.json, and update all the references to the different datasets, then run:

torchseq --train --config ./configs/hrqvae_mydataset.json

Use HRQ-VAE in your project

Have a look at ./src/hrq_vae.py for our implementation.

Citation

@misc{hosking2022hierarchical,
    title={Hierarchical Sketch Induction for Paraphrase Generation},
    author={Tom Hosking and Hao Tang and Mirella Lapata},
    year={2022},
    eprint={2203.03463},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

casually-PYlearner/hrq-vae