Implement TextRank using a dfm

Question

Implement TextRank using a dfm

Opened this issue 5 years ago · 8 comments

It would be great to be able to have a list of documents and quickly be able to summarize the relevant documents. Sometimes things like topfeatures() doesn't give enough context.

Answer 1 · 2019-06-28T04:48:12.000Z

Does "relevant" mean similarity? If so, it is coming soon.

Answer 2 · 2019-06-28T12:47:02.000Z

Yeah I guess something that would like look at top features and then pull up documents that contain the most of the top features would be considered most "relevant".

Answer 3 · 2019-06-28T14:52:03.000Z

TextRank is an implementation of PageRank to score sentences, in a way that the sentences with the highest scores can be considered good summaries of a document or a collection of documents. The methodology is described here:
https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf.

@jwijffels has written an R implementation. Some approaches score sentences based on GloVe (embedding) scores but that package uses standard distance metrics from frequencies. Would be interesting to write a quanteda function that wraps this package.

Answer 4 · 2019-06-28T14:58:20.000Z

@brousseauj The R implementation I've written in https://cran.r-project.org/web/packages/textrank/index.html looks for word overlap. Nothing stops you from calculating another similarity metric (e.g. tfidf / embedding similarity / whichever other similarity metric) and feeding that similarity function into textrank_sentences.
In fact, that's what I do in quite some projects, I now tend to use package ruimtehol https://cran.r-project.org/web/packages/ruimtehol/index.html to calculate sentence embeddings and feed them to textrank to summarise sentences

Answer 5 · 2019-07-09T12:05:25.000Z

Thats really interesting. I'll have to check out that embedding package! Compared to other embedding packages like doc2vec or word2vec, how is this one? My knowledge of embeddings is quite limited so I apologize for the basic questions!

Answer 6 · 2019-07-09T13:46:54.000Z

Ruimtehol is an r wrapper around starspace which allows to embed all the things: articles, sentences, words, bigrams, labels, tags, persons, websites, entities and entity relations. Anything quoi.

Answer 7 · 2020-11-29T06:43:21.000Z

FYI. You can also use R package doc2vec https://github.com/bnosac/doc2vec or R package word2vec https://github.com/bnosac/word2vec to use the embeddings for measuring text similarities to feed them into textrank. I think I'll push doc2vec to cran in the coming weeks (feel free to test & provide feedback), word2vec is already a long time on cran.

Answer 8 · 2020-11-29T18:45:25.000Z

Thanks @jwijffels, good suggestions. We're modularising quanteda to make it easier to maintain, easier to extend, and easier to integrate. I look forward to putting in some serious work on textmodels soon.