/pungen

A pun generator based on the surprisal principle

Primary LanguagePython

Pun Generation with Surprise

This repo contains code and data for the paper Pun Generation with Surprise.

Requirements

  • Python 3.6
  • Pytorch 0.4
conda install pytorch=0.4.0 torchvision -c pytorch
  • Fairseq(-py)
git clone -b pungen https://github.com/hhexiy/fairseq.git
cd fairseq
pip install -r requirements.txt
python setup.py build develop
  • Pretrained WikiText-103 model from Fairseq
curl --create-dirs --output models/wikitext/model https://dl.fbaipublicfiles.com/fairseq/models/wiki103_fconv_lm.tar.bz2
tar xjf models/wikitext/model -C models/wikitext
rm models/wikitext/model

Training

Word relatedness model

We approximate relatedness between a pair of words with a long-distance skip-gram model trained on BookCorpus sentences. The original BookCorpus data is parsed by scripts/preprocess_raw_text.py and you can see the sample file in sample_data/bookcorpus/raw/train.txt.

Preprocess bookcorpus data:

python -m pungen.wordvec.preprocess --data-dir data/bookcorpus/skipgram \
	--corpus data/bookcorpus/raw/train.txt \
	--min-dist 5 --max-dist 10 --threshold 80 \
	--vocab data/bookcorpus/skipgram/dict.txt

Train skip-gram model:

python -m pungen.wordvec.train --weights --cuda --data data/bookcorpus/skipgram/train.bin \
    --save_dir models/bookcorpus/skipgram \
    --mb 3500 --epoch 15 \
    --vocab data/bookcorpus/skipgram/dict.txt

Edit model

The edit model takes a word and a template (masked sentence) and combine the two coherently.

Preprocess data:

for split in train valid; do \
	PYTHONPATH=. python scripts/make_src_tgt_files.py -i data/bookcorpus/raw/$split.txt \
        -o data/bookcorpus/edit/$split --delete-frac 0.5 --window-size 2 --random-window-size; \
done

python -m pungen.preprocess --source-lang src --target-lang tgt \
	--destdir data/bookcorpus/edit/bin/data --thresholdtgt 80 --thresholdsrc 80 \
	--validpref data/bookcorpus/edit/valid \
	--trainpref data/bookcorpus/edit/train \
	--workers 8

Training:

python -m pungen.train data/bookcorpus/edit/bin/data -a lstm \
    --source-lang src --target-lang tgt \
    --task edit --insert deleted --combine token \
    --criterion cross_entropy \
    --encoder lstm --decoder-attention True \
    --optimizer adagrad --lr 0.01 --lr-scheduler reduce_lr_on_plateau --lr-shrink 0.5 \
    --clip-norm 5 --max-epoch 50 --max-tokens 7000 --no-epoch-checkpoints \
    --save-dir models/bookcorpus/edit/deleted --no-progress-bar --log-interval 5000

Retriever

Build a sentence retriever based on Bookcorpus. The input should have a tokenized sentence per line.

python -m pungen.retriever --doc-file data/bookcorpus/raw/sent.tokenized.txt \
    --path models/bookcorpus/retriever.pkl --overwrite

Analyze what makes a pun funny

Compute correlation between local-global suprise scores and human funniness ratings. We provide our annotated dataset in data/funniness_annotation:

  • analysis_pun_scores.txt: sentences annotated with funniness scores from 1 to 5.
  • analysis_zscored_pun_scores.txt: the same data where scores are standardized for each annotator.
python eval_scoring_func.py --human-eval data/funniness_annotation/analysis_zscored_pun_scores.txt \
	--lm-path models/wikitext/wiki103.pt --word-counts-path models/wikitext/dict.txt \
    --skipgram-model data/bookcorpus/skipgram/dict.txt \
                     models/bookcorpus/skipgram/sgns-e15.pt \
    --outdir results/pun-analysis/analysis_zscored \
    --features grammar ratio --analysis --ignore-cache  

Generate puns

We generate puns given a pair of pun word and alternative word. We support pun generation with the following methods specified by the system argument.

  • rule: the SURGEN method described in the paper
  • rule+neural: in the last step of SURGEN, use a neural combiner to edit the topic words
  • retrieve: retrieve a sentence containing the pun word
  • retrieve+swap: retrieve a sentence containing the alternative word and replace it with the pun word For arguments controlling the neural generator (e.g., --beam, --nbest), see fairseq.options. All results and logs are saved in outdir.
python generate_pun.py data/bookcorpus/edit/bin/data \
	--path models/bookcorpus/edit/delete/checkpoint_best.pt \
	--beam 20 --nbest 1 --unkpen 100 \
	--system rule --task edit \
	--retriever-model models/bookcorpus/retriever.pkl --doc-file data/bookcorpus/raw/sent.tokenized.txt \
	--lm-path models/wikitext/wiki103.pt --word-counts-path models/wikitext/dict.txt \
	--skipgram-model data/bookcorpus/skipgram/dict.txt models/bookcorpus/skipgram/sgns-e15.pt \
	--num-candidates 500 --num-templates 100 \
	--num-topic-word 100 --type-consistency-threshold 0.3 \
	--pun-words data/semeval/hetero/dev.json \
	--outdir results/semeval/hetero/dev/rule \
	--scorer random \
	--max-num-examples 100

Reference

If you use the annotated SemEval pun dataset, please cite our paper:

@inproceedings{he2019pun,
    title={Pun Generation with Surprise},
    author={He He and Nanyun Peng and Percy Liang},
    booktitle={North American Association for Computational Linguistics (NAACL)},
    year={2019}
}