/non-anonymized-CNN-DailyMail

A script to process non-anonymized CNN and DailyMail for summary.

Primary LanguagePythonMIT LicenseMIT

A script to produce non-anonymized CNN and DailyMail for summary. Reference: abisee/cnn-daulymail

Environment

  • python 3.6

Features

  • tokenized by CoreNLP
  • non-anonymized
  • lowercase
  • remove artical infomation
  • multiprocess
  • json(more readable)

How to use it?

1. Download data

Download the stories directories from here for both CNN and Daily Mail.

2. Download CoreNLP

Download and unzip CoreNLP from here. Add the following command in your bash_profile:

export CLASSPATH=$CLASSPATH:/path/to/stanfordnlp-corenlp-full-2018-02-27/stanford-corenlp-3.9.1.jar

3. Make dataset

# for dailymail(similar for cnn)
# if your device has multiple CPUs, you could speed up by setting -worker_num

python make_dataset.py -stories_dir dailymail/stories -tokenized_stories_dir dailymail/tokenized_stories -train_urls url_lists/dailymail_wayback_training_urls.txt -test_urls url_lists/dailymail_wayback_test_urls.txt -val_urls url_lists/dailymail_wayback_validation_urls.txt -output_dir dailymail 
python make_dataset.py -stories_dir cnn/stories -tokenized_stories_dir cnn/tokenized_stories -train_urls url_lists/cnn_wayback_training_urls.txt -test_urls url_lists/cnn_wayback_test_urls.txt -val_urls url_lists/cnn_wayback_validation_urls.txt -output_dir cnn