Doc_Summarization

The following steps in Figure 1.0 demonstrate the proposed methodology to achieve the desired results.

Figure 1.0 - The workflow of the proposed methodology.

Working

The input data used for this project was sourced from various wiki platforms.

Libraries/Dependencies used: Following are the libraries used in the code.

  • NLTK - for various uses, stopwords, tokenization, stemming, NER tagging...
  • Numpy - helps in working with arrays: array creation and manipulation
  • sys - used for printing the size of data structures used in the program
  • Matplotlib - used to visualize the data by drawing graphs of matrix inputs
  • StanfordNERTagger - Stanford NER is a Java implementation of a Named Entity Recognizer. Conveniently enough, NLTK provides a wrapper to the Stanford tagger(available here) hence we can execute it in Python.
  • networkx - networkx library helps in working with graphs.
  • sklearn - Here, the sklearn library is being used to transform a count matrix to a normalized tf or tf-idf representation and convert a collection of text documents to a matrix of token counts.
  • Results & Analysis

  • Extractive-based summarization using TextRank Algorithm;
  • Pronoun resolution summarization using Named Entity Recognition Algorithm.

  • An automatic summarization module named sumy was used as a benchmark to analyze the results of the two algorithms.

    The TextRank algorithm’s summarized output is small and precise, showing the most important sentences first according to the rank generated by the similarity matrix. The drawbacks of it are unordered sentences which reduce the meaning of the document. Also, it doesn’t take into account the use of proper pronouns in the summary. It takes the original sentences from the document without considering the meaningful ordering the summary should be in.

    The Named Entity Recognition algorithm’s summarized output provides expected results as the name suggests. It does the intended job, i.e. to recognize the named entities (proper nouns - names of people) and replace the pronouns in the article with the named entity - proper noun/names. Although this weighs over the actual purpose of meaningful summaries and is unable to retain the grammatical correctness of the original article.