/pointer_generator_model

The pointer-generator network does a better job at copying words from the source text. Additionally it also is able to copy out-of-vocabulary words allowing the algorithm to handle unseen words even if the corpus has a smaller vocabulary.

Primary LanguagePythonMIT LicenseMIT

Pointer_Generator updated for Tensorflow 1.11 +

One more thing though, I've removed ROUGE function for evaluation of model, pyrouge library seems to be have been deprecated. Please raise an issue if you must know how to still use that evaluation metric, i'll guide you out.

Environment

• Tensorflow 1.11+

• Nltk

• Windows 10 64 bit

• Python 3.6

• CUDA Toolkit 9.0

• cuDNN v7.2.

Approach to Problem:

1. Dataset:

It consists of both articles and summaries of those articles. Some of the articles have multi line summaries also.

2. NLTK Tokenizer:

Summarize by identifying “top” sentences based on word frequency.

• Tokenize words

Preprocess the text by removing numbers,wihte spaces, stopwords and punctuation. And to tokenize all words in the document

• Word Frequency

Calculate the frequency for every token in the document

• Sentence Selection

Sum of word frequency of every word in the sentence and top ‘n’ sentences are selected on highest sentence scores.

3. Models:

• Baseline sequence-to-sequence model with attention

The model may attend to relevant words in the source text to generate novel words.

• Pointer-generator model

The pointer-generator network does a better job at copying words from the source text. Additionally it also is able to copy out-of-vocabulary words allowing the algorithm to handle unseen words even if the corpus has a smaller vocabulary.

4. Summarization:

For carry out Summarization, implementation of pre-trained weights is used generated by training model.

Major Changes:

Carry out the help from the TextSum Google Tensorflow research module. And successfully converted for Tensorflow 1.11+ and changed the hyper-parameters for better accuracy

• Batch_size = 10

• LSTM hidden Units :256

• Vocabulary size: 5000

• Encoding layers: 4

• Max TimeStamp for Encoder : 400

• Max TimeStamp for Decoder: 100

How to run:

  1. Well documentation in the original pointer generator pretty much takes care of it all here, but in this model, you need to train without coverage for first 600k iteration, and then trained for next 25k iteration with coverage, that should pretty much get you the result.

  2. Next thing, there was a lot of parameters you've to mention in terminal while running, so I took liberty of making all of it default, if you must change the command you can do so by changing it in the code!

  3. Also, IMPORTANT FOR DECODING, you've to uncomment the lines of decode in last paragraph of run_summarization.py, raise an issue if you can't figure it out, i'll help you solve it.

  4. Again, I've previously mentioned how to make your own dataset out of your text file, you can use bin_vocab_creation.py file to do so!!!

  5. it is decendent of the from TextSum Google Tensorflow Reasearch module here