/pytextrank

Python implementation of TextRank for phrase extraction and summarization of text documents

Primary LanguageJupyter NotebookMIT LicenseMIT

PyTextRank

PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension, used to:

  • extract the top-ranked phrases from text documents
  • run low-cost extractive summarization of text documents
  • help infer links from unstructured text into structured data

Background

One of the goals for PyTextRank is to provide support (eventually) for entity linking, in contrast to the more commonplace usage of named entity recognition. These approaches can be used together in complementary ways to improve the results overall.

The introduction of graph algorithms -- notably, eigenvector centrality -- provides a more flexible and robust basis for integrating additional techniques that enhance the natural language work being performed. The entity linking aspects here are still a work-in-progress scheduled for a later release.

Internally PyTextRank constructs a lemma graph to represent links among the candidate phrases (e.g., unrecognized entities) and their supporting language. Generally speaking, any means of enriching that graph prior to phrase ranking will tend to improve results. Possible ways to enrich the lemma graph include coreference resolution and semantic relations, as well as leveraging knowledge graphs in the general case.

For example, WordNet and DBpedia both provide means for inferring links among entities, and purpose-built knowledge graphs can be applied for specific use cases. These can help enrich a lemma graph even in cases where links are not explicit within the text. Consider a paragraph that mentions cats and kittens in different sentences: an implied semantic relation exists between the two nouns since the lemma kitten is a hyponym of the lemma cat -- such that an inferred link can be added between them.

This has an additional benefit of linking parsed and annotated documents into more structured data, and can also be used to support knowledge graph construction.

The TextRank algorithm used here is based on research published in:
"TextRank: Bringing Order into Text"
Rada Mihalcea, Paul Tarau
Empirical Methods in Natural Language Processing (2004)

Several modifications in PyTextRank improve on the algorithm originally described in the paper:

  • fixed a bug: see Java impl, 2008
  • use lemmatization in place of stemming
  • include verbs in the graph (but not in the resulting phrases)
  • leverage preprocessing via noun chunking and named entity recognition
  • provide extractive summarization based on ranked phrases

This implementation was inspired by the Williams 2016 talk on text summarization. Note that while much better approaches exit for summarizing text, questions linger about some of the top contenders -- see: 1, 2. Arguably, having alternatives such as this allow for cost trade-offs.

Installation

Prerequisites:

To install from PyPi:

pip install pytextrank
python -m spacy download en_core_web_sm

If you install directly from this Git repo, be sure to install the dependencies as well:

pip install -r requirements.txt
python -m spacy download en_core_web_sm

Usage

import spacy
import pytextrank

# example text
text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types."

# load a spaCy model, depending on language, scale, etc.
nlp = spacy.load("en_core_web_sm")

# add PyTextRank to the spaCy pipeline
tr = pytextrank.TextRank()
nlp.add_pipe(tr.PipelineComponent, name="textrank", last=True)

doc = nlp(text)

# examine the top-ranked phrases in the document
for p in doc._.phrases:
    print("{:.4f} {:5d}  {}".format(p.rank, p.count, p.text))
    print(p.chunks)

For other example usage, see the PyTextRank wiki. If you need to troubleshoot any problems:

For related course materials and training, please check for calendar updates in the article "Natural Language Processing in Python".

Let us know if you find this package useful, tell us about use cases, describe what else you would like to see integrated, etc. For inquiries about consulting work in machine learning, natural language, knowledge graph, and other AI applications, contact Derwen, Inc.

Links

Testing

To run the unit tests:

coverage run -m unittest discover

To generate a coverage report and upload it to the codecov.io reporting site:

coverage report
bash <(curl -s https://codecov.io/bash) -t @.cc_token

Test coverage reports can be viewed at https://codecov.io/gh/DerwenAI/pytextrank

Attribution

PyTextRank has an MIT license, which is succinct and simplifies use in commercial applications.

Please use the following BibTeX entry for citing PyTextRank if you use it in your research or software. Citations are helpful for the continued development and maintenance of the library.

@Misc{PyTextRank,
author = {Nathan, Paco},
title = {PyTextRank, a Python implementation of TextRank for phrase extraction and summarization of text documents},
    howpublished = {\url{https://github.com/DerwenAI/pytextrank/}},
    year = {2016}
    }

TODOs

  • build a conda package
  • show examples of spacy-wordnet to enrich the lemma graph
  • leverage neuralcoref to enrich the lemma graph
  • generate a phrase graph, with entity linking into Wikidata, etc.
  • include more unit tests
  • fix Sphinx errors, generate docs

Kudos

Many thanks to our contributors: @htmartin, @williamsmj, @mattkohl, @vanita5, @HarshGrandeur, @mnowotka, @kjam, @dvsrepo, @SaiThejeshwar, @laxatives, @dimmu, @JasonZhangzy1757, @jake-aft, @junchen1992, @Ankush-Chander, @shyamcody, @chikubee, encouragement from the wonderful folks at spaCy, plus general support from Derwen, Inc.

thx noam