This project is the implementation of pseudo-reference generation algorithm proposed in EMNLP 2018 paper: Multi-Reference Training with Pseudo-References for Neural Translation and Text Generation.
- python: 2.7
- pytrch: 0.3.1
- torchtext: 0.2.1
- networkx: 2.0
- numpy: 1.13.3
- sklearn: 0.19.1
- matplotlib: 2.1.0
- scipy: 0.19.1
- nltk: 3.2.4
This project includes both hard word alignment and soft word alignment algorithms to generate lattice. You can use the coco-caption dataset or a small dataset extracted from coco we provide at 'data/dataset_small.json'. You can generate lattice with hard or soft word alignment algorithm via the following example commands.
python lattice.py -order_method hard -align_method hard -dataset data/dataset_small.json -minus 0.5
python lattice.py -order_method soft -align_method soft -dataset data/dataset_small.json -minus 0.6 -lm_dictionary data/LM_coco.dict -lm_model data/LM_coco.pth
- align_method [soft|hard]
- Soft or hard alignment described in the paper
- order_method [soft|hard|random]
- Sort original sentences before merging.
- minus
- Global penalty
$p$ described in the paper to avoid merge unrelated words.
- Global penalty
- gpuid
- Set GPU for soft word alignment
- save_graph
- If True, the lattice graph will be saved instead of pseudo-references
- multi_process
- Enable multi-processing to speedup generation
- n_cpu
- Number of threads will be used in multi-processing
- dataset
- We provide a small dataset 'data/dataset_small.json' extracted from coco-caption. In this dataset there are only three examples including the first one from dev-set which will be involved in pseudo-ref generation algorithm.
- lm_model
- The language model used in soft word alignment algorithm
- lm_dictionary
- The dictionary file of the language model used in soft word alignment algorithm
The output will be a json file stalled as 'data/dataset_(ORDER_METHOD)_(ALIGNMENT_METHOD)_(MINUS)'.
A bidirectional language model will be used in the soft sentence alignment algorithm. Our implementation is included in the folder 'language_model'. We provide a model trained on MSCOCO at 'data/LM_coco.pth' and it's corresponding dictionary data 'data/LM_coco.dict'. (Please note that this language model is slightly different to the one used in paper, so the output lattice maybe different.)
We provide codes to visualize generated lattice by converting it into LaTex. For example, you can use the following command to print the 1st (start from 0) lattice in json file data/dataset_soft_soft_0.60.json which is generated from data/dataset_small.json.
python lattice2latex.py -original_dataset data/dataset_small.json -lattice_dataset data/dataset_soft_soft_0.60.json -lattice_index 1