basically
is a Go implementation of the TextRank and Biased TextRank algorithm built on prose
. It provides fully unsupervised methods for keyword extraction and focused text summarization, along with some additional quality of life features over the original implementations.
First, the document is parsed into its constituent sentences and words using a sentence segmenter and tokenizer. Sentiment values are assigned to individual sentences, and tokens are annotated with part of speech tags.
For keyword extraction, all words that pass the syntactic filter are added to a undirected, weighted graph, and an edge is added between words that co-occur within a window of N words. The edge weight is set to be inversely proportional to the distance between the words. Each vertex is assigned an initial score of 1, and the following ranking algorithm is run on the graph
During post-processing, adjacent keywords are collapsed into a multi-word keyword, and the top keywords are then extracted.
For sentence extraction, every sentence is added to a undirected, weighted graph, with an edge between sentences that share common content. The edge weight is set simply as the number of common tokens between the lexical representations of the two sentences. Each vertex is also assigned an initial score of 1, and a bias score based on the focus text, before the following ranking algorithm is run on the graph
The top weighted sentences are then selected and sorted in chronological order to form a summary.
Further information on the two algorithms can be found here and here.
go get https://github.com/algao1/basically
Initialization:
// Instantiate the summarizer, highlighter, and parser.
s := &btrank.BiasedTextRank{}
h := &trank.KWTextRank{}
p := parser.Create()
// Instantiate a document for every given text.
doc, err := document.Create(text, s, h, p)
if err != nil {
log.Fatal(err)
}
Text Summarization:
// Summarize the document into 7 sentences, with no threshold value, and with respect to a focus sentence.
sents, err := doc.Summarize(7, 0, focus)
if err != nil {
log.Fatal(err)
}
for _, sent := range sents {
fmt.Printf("[%.2f, %.2f] %s\n", sent.Score, sent.Sentiment, sent.Raw)
}
Keyword Extraction:
// Highlight the top 7 keywords in the document, with multi-word keywords enabled.
words, err := doc.Highlight(7, true)
if err != nil {
log.Fatal(err)
}
for _, word := range words {
fmt.Println(word.Weight, word.Word)
}
Optionally, we can specify configurations such as retaining conjunctions at the beginning of sentences for our summary
doc, err := document.Create(text, s, h, p, document.WithConjunctions())
Below is a rudimentary comparison of basically
's performance against other implementations using news articles from The Guardian:
Library | Language | Avg Speed |
---|---|---|
summa | Python | 1.67s |
basically | Go | 0.48s |
This project was started to better familiarize myself with Go, and some best practices
- How to idiomatically structure applications
- How to idiomatically handle errors
- How to style and format Go code
- How to test and benchmark
Currently the project is more or less complete, with no major foreseeable updates. However, I'll be periodically updating the library as things come to mind.