Towards Content Transfer through Grounded Text Generation

This repo contains the code and data of the following paper:

Towards Content Transfer through Grounded Text Generation. Shrimai Prabhumoye, Chris Quirk, Michel Galley. NAACL 2019. arXiv

Dependencies

Python 3.6
Pytorch 0.3
sentencepiece
NLTK
nltk.download("stopwords") in your python terminal.

Data

Dowload the train, dev, and test data for all the experiments from the following link:

http://tts.speech.cs.cmu.edu/content_transfer/train_data.zip
unzip train_data.zip

The *.src files contain the news articles, *.cxt files contain the Wikipedia context, *.tgt files contain the target sentences and the *.srcxt files contain the news articles concatenated with Wikipedia context used in CAG models.

Dowload the raw data for train, dev, and test splits from the following link:

http://tts.speech.cs.cmu.edu/content_transfer/raw_data.zip
unzip raw_data.zip

The raw data gives the following information:

wikiID: Wikipedia page ID
wikiTitle: Wikipedia page Title
wikiContext: Context of the Wikipedia article as is. This is a list of list of sentences.
Target: Target sentence from the Wikipedia article as is.
clean_wikiContext: The cleaned version of the Wikipedia context. This is a list of sentences.
clean_Target: The cleaned version of the target sentence.
domain: The domain of the news article.
URL: The URL of the news article.
curlCommand: The curl command to download the news article from common crawl.
HTML_Text: HTML of the news article converted to plain text.
clean_HTML_Text: Clean version of the plain text.

The domains.txt file contains the list of domains used to collect the dataset.

Models

Trained sentencepiece model

Download the trained sentencepeice model used in all experiments.

http://tts.speech.cs.cmu.edu/content_transfer/sentencepieceModel.zip
unzip sentencepieceModel.zip

To train sentencepiece model on your data:

python sentence_piece.py -mode train -input sentencepieceModel/train.data -model_prefix testModel -model_type bpe -vocab_size 32000

To encode data using the trained sentencepiece model:

python sentence_piece.py -mode encode -input inputFilename.txt -model sentencepieceModel/bpeM.model -output outputFilename.txt

To decode the generated data using the trained sentencepiece model:

python sentence_piece.py -mode decode -input inputFilename.txt -output outputFilename.txt -model sentencepieceModel/bpeM.model

Sum-Basic(SB) and Context Informed Sum-Basic (CISB)

python sumbasicUpdate.py -input raw_data/filename.csv -output filename.txt

Use the -context_update flag for CISB.

Context Agnostic Generative (CAG) Model and Context Informed Generative (CIG) Model

Please use the code base in the following git repo for these two models: https://github.com/shrimai/Style-Transfer-Through-Back-Translation Refer to example.sh file to see the commands.

Download the trained CAG model:

http://tts.speech.cs.cmu.edu/content_transfer/cag_model.zip
unzip cag_model.zip

Download the trained CIG model:

http://tts.speech.cs.cmu.edu/content_transfer/cig_model.zip
unzip cig_model.zip

Context Receptive Generative (CRG) Model

Follow the example.sh file in the context_receptive_generative/ directory.

Download the trained CRG model:

http://tts.speech.cs.cmu.edu/content_transfer/crg_model.zip
unzip crg_model.zip

If you are using this data or code then please cite the following paper::

@inproceedings{content_transfer_naacl19,
title={Towards Content Transfer through Grounded Text Generation},
author={Prabhumoye, Shrimai and Quirk, Chris and Galley, Michel},
year={2019},
booktitle={Proc. NAACL}
}

shrimai/Towards-Content-Transfer-through-Grounded-Text-Generation

Towards Content Transfer through Grounded Text Generation

Dependencies

Data

Models

Trained sentencepiece model

Sum-Basic(SB) and Context Informed Sum-Basic (CISB)

Context Agnostic Generative (CAG) Model and Context Informed Generative (CIG) Model

Context Receptive Generative (CRG) Model