/basically

A Go library to summarize and extract keywords from text.

Primary LanguageGoMIT LicenseMIT

basically

basically is a Go implementation of the TextRank and Biased TextRank algorithm built on prose. It provides fully unsupervised methods for keyword extraction and focused text summarization, along with some additional quality of life features over the original implementations.

Methods

First, the document is parsed into its constituent sentences and words using a sentence segmenter and tokenizer. Sentiment values are assigned to individual sentences, and tokens are annotated with part of speech tags.

For keyword extraction, all words that pass the syntactic filter are added to a undirected, weighted graph, and an edge is added between words that co-occur within a window of N words. The edge weight is set to be inversely proportional to the distance between the words. Each vertex is assigned an initial score of 1, and the following ranking algorithm is run on the graph

During post-processing, adjacent keywords are collapsed into a multi-word keyword, and the top keywords are then extracted.

For sentence extraction, every sentence is added to a undirected, weighted graph, with an edge between sentences that share common content. The edge weight is set simply as the number of common tokens between the lexical representations of the two sentences. Each vertex is also assigned an initial score of 1, and a bias score based on the focus text, before the following ranking algorithm is run on the graph

The top weighted sentences are then selected and sorted in chronological order to form a summary.

Further information on the two algorithms can be found here and here.

Installation

go get https://github.com/algao1/basically

Usage

Initialization:

// Instantiate the summarizer, highlighter, and parser.
s := &btrank.BiasedTextRank{}
h := &trank.KWTextRank{}
p := parser.Create()

// Instantiate a document for every given text.
doc, err := document.Create(text, s, h, p)
if err != nil {
	log.Fatal(err)
}

Text Summarization:

// Summarize the document into 7 sentences, with no threshold value, and with respect to a focus sentence.
sents, err := doc.Summarize(7, 0, focus)
if err != nil {
	log.Fatal(err)
}

for _, sent := range sents {
	fmt.Printf("[%.2f, %.2f] %s\n", sent.Score, sent.Sentiment, sent.Raw)
}

Keyword Extraction:

// Highlight the top 7 keywords in the document, with multi-word keywords enabled.
words, err := doc.Highlight(7, true)
if err != nil {
	log.Fatal(err)
}

for _, word := range words {
	fmt.Println(word.Weight, word.Word)
}

Optionally, we can specify configurations such as retaining conjunctions at the beginning of sentences for our summary

doc, err := document.Create(text, s, h, p, document.WithConjunctions())

Benchmarks

Text Summarization & Keyword Extraction

Below is a rudimentary comparison of basically's performance against other implementations using news articles from The Guardian:

Library Language Avg Speed
summa Python 1.67s
basically Go 0.48s

Things I Learned

This project was started to better familiarize myself with Go, and some best practices

Next Steps

Currently the project is more or less complete, with no major foreseeable updates. However, I'll be periodically updating the library as things come to mind.