/make_datafiles_for_pgn

The origin code is from https://github.com/becxer/cnn-dailymail/

Primary LanguagePython

The origin code is from https://github.com/becxer/cnn-dailymail/

Instructions

It processes your test data into the binary format expected by the code for the Tensorflow model , as used in the ACL 2017 paper Get To The Point: Summarization with Pointer-Generator Networks.

Environment

item detail
OS Windows 10 64 bit
Python Python 3.5
Tensorflow Tensorflow 1.2.1
CUDA CUDA® Toolkit 8.0
cuDNN cuDNN v5.1
stanford-corenlp stanford-corenlp-3.9.1

How to use?

1. Download Stanford CoreNLP

We will need Stanford CoreNLP to tokenize the data. Download it here and unzip it.
Then add stanford-corenlp-3.7.0.jar(stanford-corenlp-full-2018-02-27/stanford-corenlp-3.9.1.jar) to ypur environment variable.
In my case,I add below

D:\data\tensorflow\pgn\stanford-corenlp-full-2018-02-27\stanford-corenlp-3.9.1.jar

path1.png path1.png path1.png

You can check if it's working by running

echo "Please tokenize this text." | java edu.stanford.nlp.process.PTBTokenizer

You should see something like:

Please
tokenize
this
text
.
PTBTokenizer tokenized 5 tokens at 68.97 tokens per second.

2. Process test data into .bin

USAGE : python make_datafiles.py <stories_dir> <out_dir>

d:
cd make_datafiles_dondon
python make_datafiles.py  ./stories  ./output

3. Download Pointer-generator Networks

3. Download pretrained model

4. Download the processed data

User @JafferWilson has provided the processed data, which you can download here.

5. Summarization your Test data

python run_summarization.py --mode=decode --data_path=C:\\tmp\\data\\finished_files\\chunked\\test_*  --vocab_path=D:\\data\\tensorflow\\pgn\\CNN_Daily_Mail\\finished_files\\vocab --log_root=D:\\data\\tensorflow\\pgn --exp_name=pretrained_model --max_enc_steps=500 --max_dec_steps=40 --coverage=1 --single_pass=1