DiGraph instead of Graph
Chtchou opened this issue · 2 comments
Is there a specific reason why pytextrank uses a Directed Graph instead of an undirected Graph?
On your explanatory notebook you use an undirected graph (like the original implementation) but in the source code it has been changed to directed graph instead?
This can creates totally different results
Thanks @Chtchou for the heads up.
I did some experiments using DiGraph and Graph(undirected) and founds results to be different.
input text:
Compatibility of systems of linear constraints over the set of natural numbers.
Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered.
Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given.
These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types.
DiGraph with edges from token to neighbour (current setup):
self.ranks: [('system', 0.127), ('type', 0.108), ('consider', 0.066), ('set', 0.063), ('inequation', 0.063), ('solution', 0.046), ('mixed', 0.044), ('give', 0.039), ('minimal', 0.038), ('number', 0.037), ('nonstrict', 0.035), ('linear', 0.029), ('equation', 0.026), ('strict', 0.025), ('algorithm', 0.023), ('solve', 0.023), ('natural', 0.022), ('constraint', 0.021), ('Diophantine', 0.021), ('use', 0.021), ('construction', 0.02), ('generating', 0.02), ('support', 0.018), ('construct', 0.014), ('component', 0.012), ('bound', 0.009), ('compatibility', 0.008), ('corresponding', 0.008), ('criterion', 0.007), ('upper', 0.007)]
Using DiGraph with direction of edges reversed:
self.ranks: [('criterion', 0.104), ('system', 0.068), ('set', 0.067), ('minimal', 0.061), ('solution', 0.056), ('upper', 0.052), ('algorithm', 0.05), ('compatibility', 0.046), ('linear', 0.043), ('bound', 0.031), ('corresponding', 0.031), ('type', 0.03), ('construct', 0.027), ('construction', 0.026), ('Diophantine', 0.025), ('component', 0.025), ('support', 0.025), ('equation', 0.023), ('inequation', 0.023), ('consider', 0.022), ('generating', 0.022), ('strict', 0.021), ('use', 0.021), ('constraint', 0.02), ('solve', 0.02), ('nonstrict', 0.016), ('natural', 0.012), ('mixed', 0.012), ('number', 0.009), ('give', 0.009)]
Using Graph
self.ranks: [('set', 0.087), ('system', 0.079), ('solution', 0.062), ('minimal', 0.057), ('linear', 0.048), ('type', 0.043), ('algorithm', 0.04), ('consider', 0.039), ('inequation', 0.037), ('compatibility', 0.032), ('constraint', 0.029), ('equation', 0.029), ('strict', 0.029), ('criterion', 0.028), ('Diophantine', 0.028), ('construct', 0.028), ('generating', 0.027), ('support', 0.027), ('use', 0.027), ('solve', 0.027), ('component', 0.025), ('construction', 0.023), ('natural', 0.021), ('nonstrict', 0.021), ('bound', 0.021), ('corresponding', 0.02), ('upper', 0.018), ('number', 0.017), ('give', 0.016), ('mixed', 0.016)]
Observations:
- current implementation tends to favour the words at the end of the sentences. Notice how
consider
gets higher rank by virtue of being at the later half of sentences andcriterion
gets punished for being at the beginning. These rankings get somewhat reversed when direction of edges change. - As per networkx pagerank docs
The PageRank algorithm was designed for directed graphs but this algorithm does not check if the input graph is directed and will execute on undirected graphs by converting each oriented edge in the directed graph to two edges.
Since co-occurence is symetric and we want influence to flow equally in both the directions, I intend to change it to Graph along with test cases changes if required.
Looks great, many thanks @Ankush-Chander and @Chtchou