Summarization: A Python repository from Rushab1

###############################################################
Summarizers
###############################################################
- Create a directory named Summarizers. Clone OpenNMT-py and Chen&Bansal summarizer repositories
- Create a directory named modelfiles. Create 2 sub-directories named ./modelfiles/OpenNMT-py/ and ./modelfiles/chen_and_bansal/
- Download the Transformer summarizer sum_transformer_model_acc_57.25_ppl_9.22_e16.pt in the OpenNMT directory and extract pretrained models from chena_and_bansal into the chen_and_bansal directory (link on the chen_and_bansal repo)

###############################################################
Packages
###############################################################
tensorflow, tensorflow_hub, tensorflow-gpu, pytorch

Create a folder called packages - install pyrouge, stanford coreNLP 

Installing pyrouge:
- Follow directions here: https://poojithansl7.wordpress.com/2018/08/04/setting-up-rouge/
- For running pyrouge tests: cd pyrouge/pyrouge/ && python test.py
- Try not to run sudo and using a virtual env


###############################################################
#Preprocessing
###############################################################
- The dataset must be stored in a directory in Data/Datasets/<Dataset_name>
- Preferably must have 4 subdirectories Business, Science, Sports, USIntlRelations (Like NYT corpus) - if not just set the DOMAINS variable to ["All"], you might have to make a few more changes 
- Each of the 4 domains must have json files (one file per article) - containing pairs of articles from article and summary (m article sentences X n summary sentences)
- I'll write up a Preprocessing module later
- look at the file dataset_example_file.json
- The fields source (Summary) and target(Article) are necessary- It's opposite of what it should be but somebody else made this.
- Other fields are irrelevant for this task, though you may include more/discard them.
Rushab1/Summarization