Pseudo-references Generation

This project is the implementation of pseudo-reference generation algorithm proposed in EMNLP 2018 paper: Multi-Reference Training with Pseudo-References for Neural Translation and Text Generation.

Dependencies

python: 2.7 (using example bi-directional language model)
python: 3.6 (using pre-trained ELMo bi-directional language model)
allennlp: 0.7.0 (using pre-trained ELMo bi-directional language model)
pytrch: 0.3.1
torchtext: 0.2.1
networkx: 2.0
numpy: 1.13.3
sklearn: 0.19.1
matplotlib: 2.1.0
scipy: 0.19.1
nltk: 3.2.4

Lattice Generation

This project includes both hard word alignment and soft word alignment algorithms to generate lattice. You can use the coco-caption dataset or a small dataset extracted from coco we provide at 'data/dataset_small.json'. You can generate lattice with hard or soft word alignment algorithm via the following example commands with Python 2.7.

python lattice.py -order_method hard -align_method hard -dataset data/dataset_small.json -minus 0.5
python lattice.py -order_method soft -align_method soft -dataset data/dataset_small.json -minus 0.6 -lm_dictionary data/LM_coco.dict -lm_model data/LM_coco.pth

If you want to use ELMo language model (Deep contextualized word representations) which is pre-trained in a larger corpora, you can use the following example command with Python 3.6 (ELMo from allennlp only support Python 3.6). You must download the ELMo weights file before using it.

cd ./data
wget https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_options.json
wget https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5
cd ..
python lattice.py -order_method soft -align_method soft -dataset data/dataset_small.json -minus 0.6 -use_elmo

align_method [soft|hard]
- Soft or hard alignment described in the paper
order_method [soft|hard|random]
- Sort original sentences before merging.
minus
- Global penalty $p$ described in the paper to avoid merge unrelated words.
gpuid
- Set GPU for soft word alignment
save_graph
- If True, the lattice graph will be saved instead of pseudo-references
multi_process
- Enable multi-processing to speedup generation
n_cpu
- Number of threads will be used in multi-processing
dataset
- We provide a small dataset 'data/dataset_small.json' extracted from coco-caption. In this dataset there are only three examples including the first one from dev-set which will be involved in pseudo-ref generation algorithm.
lm_model
- The language model used in soft word alignment algorithm
lm_dictionary
- The dictionary file of the language model used in soft word alignment algorithm
use_elmo
- Use pre-trained ELMo language model fro AllenNLP

The output will be a json file stalled as 'data/dataset_(ORDER_METHOD)_(ALIGNMENT_METHOD)_(MINUS)'.

Bidirectional Language Model

A bidirectional language model will be used in the soft sentence alignment algorithm. Our implementation is included in the folder 'language_model'. We provide a model trained on MSCOCO at 'data/LM_coco.pth' and it's corresponding dictionary data 'data/LM_coco.dict'. (Please note that this language model is slightly different to the one used in paper, so the output lattice maybe different.)

We also provide ELMo as an alternative of the bidirectional language model to enable the soft sentence alignment algorithm on more general purposes. You can find pre-trained ELMo in other languages via this repository.

Lattice Visualization

We provide codes to visualize generated lattice by converting it into LaTex. For example, you can use the following command to print the 1st (start from 0) lattice in json file data/dataset_soft_soft_0.60.json which is generated from data/dataset_small.json.

python lattice2latex.py -original_dataset data/dataset_small.json -lattice_dataset data/dataset_soft_soft_0.60.json -lattice_index 1

Generation Speed

If the input sentences are quite long, it will take a long time to do the depth first search. Here are several advices for that case:

Start from a high threshold of 'minus' or use the dynamic threshold described in the paper.
When do DFS, memorize the prefix from root to current node. This will speedup the algorithm but uses much more memory.
When the lattice graph is generated, you can first calculate the number of path traversing each edge, then take the graph as a probabilistic graph and sample N (e.g. 100) trajectories. This is a linear algorithm.