Understanding the discounted normalized rank
debraj135 opened this issue · 2 comments
I wanted to understand if there is any documentation explaining the logic between here and here.
It is not clear to me why the phrase rank is the square root of the sum rank / (span length + # non-lemmas)
. More specifically, why add span length and non lemmas, aren't you over counting the non lemmas here. Also, why the square root?
Along similar lines, the motivation behind the formula for the non lemma discount, as well as its purpose in line 553 is unclear to me. It would be great if you could help me build some intuition around this.
Why is it not sufficient to simply use sum rank / span length
?
Thank you for any response, and kudos for the great work!
Hi @debraj135 , great questions!
I've needed to revisit that part, and describe it better – based on our upcoming work.
FWIW, the _calc_discounted_normalised_rank()
method will be replaced during the kglab
integration, although I can describe the motivation for it and what purpose it served. This is closely related to one of your questions:
why add span length and non lemmas, aren't you over counting the non lemmas here
The original paper [mihalcea04textrank] describes how to construct a graph and calculate the TextRank algorithm for extracting ranked "keywords", as well as how to run extractive summarization using the graph results. However, the results in the paper (e.g., Figure 2) show multi-word phrases being extracted and ranked, based on this approach:
"During post-processing, all lexical units selected as potential keywords by the TextRank algorithm are marked in the text, and sequences of adjacent keywords are collapsed into a multi-word keyword.
When I first worked on an implementation of this algorithm in 2008, I identified what appeared to be a bug in the paper's pseudocode. A co-worker similarly tried to implement the pseudocode and identified the same problem. So my first Java open source implementation on GitHub for TextRank had that bug fixed, which appeared to lead to slightly improved results.
Meanwhile, empirically the single-word keyphrases seemed to have good rankings, although the multi-word results appeared less good. At that time, I moved to a company (ShareThis) where use used the Java impl in production for a large recommender system (based on parsing 8 million documents) and it became a priority to improve multi-word keyphrase extraction.
It's possible to leverage NER results, noun chunk results, etc., that result from popular NLP packages – and along the way we've used OpenNLP, NLTK, TextBlob, and now spaCy. Again, empirically the results from these packages can be poor at times, although they've improved dramatically since 2015 with spaCy and now again with deep learning models used within spaCy pipeline components. Even so, the packages haven't been all that good at extracting keyphrases -- not without lots of prior training.
If one considers the general case of noun phrases to be the space of potential multi-word keyphrases to extract, that leads to a problem where a "greedy" algorithm will tend to produce overly long phrases as the top-ranked results. So we needed an approach that was relatively mathematically stable and robust to discount the overly long phrases while rewarding the "longer than a single word" cases.
To answer about "over counting non lemmas" specifically, in the code section you've identified, that's a point estimate calculation. In other words, we're using notions of shrinkage and interval theory to condition the ratio of:
[ sum of ranks of lemma tokens ] / [ number of non-lemma tokens within the span ]
Mihalcea had originally only considered multi-word phrases where all of the words within a span are adjacent lemmas in the TextRank graph. This approach leverages the "skip gram" aspects of TextRank to hop over non-lemma words, which appears to construct phrases that are closer to what human annotators would do. The point estimates help make this "discounted" rank more robust in a wide range of applications (what we've found so far in practice, at least).
Again, now as we're bringing in kglab
to provide a semantic layer as an "outer shell" constructed beyond the lemma graph of PyTextRank, then we gain several advantages:
-
can leverage the hypernyms and hyponyms within the taxonomy defined by an input semantic graph (KG) to enrich the lemma graph, prior to running the centrality graph algorithm to obtain the ranks – experimentally, that's been demonstrated to improve results overall quite a lot.
-
can leverage inference within the KG graph (multiple forms: semantic closures, probabilistic graphs reasoning, embedding with deep learning, etc.) to help determine which are the more likely phrases to extract.
-
can use the KG to guide the multi-word keyphrase extraction (priors)
So the KG as input can help in multiple ways to extract multi-word keyphrases while not relying solely on an overly greedy statistical parser approaches. A side-effect is that after TextRank with KG calculates, then entity linking of phrases back into the KG is a natural outcome.
BTW, this comment in the code is incorrect:
use a root mean square (RMS) to normalize the contributions of all the tokens
There was another part of an earlier implementation that used RMS for approximating a multivariate ranking, similar to the usage of this https://github.com/DerwenAI/kglab/blob/main/kglab/util.py#L91 although the point estimate above simplified that, and I didn't remove the comment.
Does this explanation help provide some insights to what's intended?
Also, do you have other approaches to suggest? We're all ears :)
If you're interested in working on PyTextRank, we have a Slack for the developers where we can discuss more. I could send you a link if you'd like?
Thank you for the detailed explanation. This definitely provides a lot of clarity into the rank computation.
Due to other commitments, I will not be able to spend time for working on PyTextRank at the moment. However, I am an admirer of all the work this group has been up to with PyTextRank and kglab and will look forward to the integration.