This repo contains the code and data of the following paper:
Towards Content Transfer through Grounded Text Generation. Shrimai Prabhumoye, Chris Quirk, Michel Galley. NAACL 2019. arXiv
- Python 3.6
- Pytorch 0.3
- sentencepiece
- NLTK
- nltk.download("stopwords") in your python terminal.
Dowload the train, dev, and test data for all the experiments from the following link:
http://tts.speech.cs.cmu.edu/content_transfer/train_data.zip
unzip train_data.zip
The *.src files contain the news articles, *.cxt files contain the Wikipedia context, *.tgt files contain the target sentences and the *.srcxt files contain the news articles concatenated with Wikipedia context used in CAG models.
Dowload the raw data for train, dev, and test splits from the following link:
http://tts.speech.cs.cmu.edu/content_transfer/raw_data.zip
unzip raw_data.zip
The raw data gives the following information:
- wikiID: Wikipedia page ID
- wikiTitle: Wikipedia page Title
- wikiContext: Context of the Wikipedia article as is. This is a list of list of sentences.
- Target: Target sentence from the Wikipedia article as is.
- clean_wikiContext: The cleaned version of the Wikipedia context. This is a list of sentences.
- clean_Target: The cleaned version of the target sentence.
- domain: The domain of the news article.
- URL: The URL of the news article.
- curlCommand: The curl command to download the news article from common crawl.
- HTML_Text: HTML of the news article converted to plain text.
- clean_HTML_Text: Clean version of the plain text.
The domains.txt
file contains the list of domains used to collect the dataset.
Download the trained sentencepeice model used in all experiments.
http://tts.speech.cs.cmu.edu/content_transfer/sentencepieceModel.zip
unzip sentencepieceModel.zip
To train sentencepiece model on your data:
python sentence_piece.py -mode train -input sentencepieceModel/train.data -model_prefix testModel -model_type bpe -vocab_size 32000
To encode data using the trained sentencepiece model:
python sentence_piece.py -mode encode -input inputFilename.txt -model sentencepieceModel/bpeM.model -output outputFilename.txt
To decode the generated data using the trained sentencepiece model:
python sentence_piece.py -mode decode -input inputFilename.txt -output outputFilename.txt -model sentencepieceModel/bpeM.model
python sumbasicUpdate.py -input raw_data/filename.csv -output filename.txt
Use the -context_update
flag for CISB.
Please use the code base in the following git repo for these two models: https://github.com/shrimai/Style-Transfer-Through-Back-Translation Refer to example.sh file to see the commands.
Download the trained CAG model:
http://tts.speech.cs.cmu.edu/content_transfer/cag_model.zip
unzip cag_model.zip
Download the trained CIG model:
http://tts.speech.cs.cmu.edu/content_transfer/cig_model.zip
unzip cig_model.zip
Follow the example.sh file in the context_receptive_generative/
directory.
Download the trained CRG model:
http://tts.speech.cs.cmu.edu/content_transfer/crg_model.zip
unzip crg_model.zip
If you are using this data or code then please cite the following paper::
@inproceedings{content_transfer_naacl19,
title={Towards Content Transfer through Grounded Text Generation},
author={Prabhumoye, Shrimai and Quirk, Chris and Galley, Michel},
year={2019},
booktitle={Proc. NAACL}
}