Implement TextRank using a dfm
Opened this issue ยท 8 comments
It would be great to be able to have a list of documents and quickly be able to summarize the relevant documents. Sometimes things like topfeatures() doesn't give enough context.
Does "relevant" mean similarity? If so, it is coming soon.
Yeah I guess something that would like look at top features and then pull up documents that contain the most of the top features would be considered most "relevant".
TextRank is an implementation of PageRank to score sentences, in a way that the sentences with the highest scores can be considered good summaries of a document or a collection of documents. The methodology is described here:
https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf.
@jwijffels has written an R implementation. Some approaches score sentences based on GloVe (embedding) scores but that package uses standard distance metrics from frequencies. Would be interesting to write a quanteda function that wraps this package.
@brousseauj The R implementation I've written in https://cran.r-project.org/web/packages/textrank/index.html looks for word overlap. Nothing stops you from calculating another similarity metric (e.g. tfidf / embedding similarity / whichever other similarity metric) and feeding that similarity function into textrank_sentences
.
In fact, that's what I do in quite some projects, I now tend to use package ruimtehol https://cran.r-project.org/web/packages/ruimtehol/index.html to calculate sentence embeddings and feed them to textrank to summarise sentences
Thats really interesting. I'll have to check out that embedding package! Compared to other embedding packages like doc2vec or word2vec, how is this one? My knowledge of embeddings is quite limited so I apologize for the basic questions!
Ruimtehol is an r wrapper around starspace which allows to embed all the things: articles, sentences, words, bigrams, labels, tags, persons, websites, entities and entity relations. Anything quoi.
FYI. You can also use R package doc2vec https://github.com/bnosac/doc2vec or R package word2vec https://github.com/bnosac/word2vec to use the embeddings for measuring text similarities to feed them into textrank. I think I'll push doc2vec to cran in the coming weeks (feel free to test & provide feedback), word2vec is already a long time on cran.
Thanks @jwijffels, good suggestions. We're modularising quanteda to make it easier to maintain, easier to extend, and easier to integrate. I look forward to putting in some serious work on textmodels soon.