Input text from RAKE paper shows different output
sleepycat opened this issue · 2 comments
Hey @waseem18!
In one of the tests the text includes "'committal, theory" and I noticed that node-rake
listed one of the keywords as "'committal theory". This makes me think there is something not quite right in the algorithm. Maybe a regex needs some adjustment.
To dig a little deeper I looked at the paper (I haven't read the whole thing yet) but it actually has some example text and lists the output pages 161-162. I thought that would make a good test:
it('produces the output from the paper', () => {
let textFromThePaper = `
Compatibility of systems of linear constraints over the set of natural numbers
Criteria of compatibility of a system of linear Diophantine equations, strict inequations,
and nonstrict inequations are considered. Upper bounds for components of a minimal set
of solutions and algorithms of construction of minimal generating sets of solutions for all
types of systems are given. These criteria and the corresponding algorithms for
constructing a minimal supporting set of solutions can be used in solving all the
considered types of systems and systems of mixed types.
`
let results = rake.generate(textFromThePaper)
expect(results).toEqual([
"minimal generating sets",
"linear diophantine equations",
"minimal set",
"minimal supporting set",
"linear constraints",
"natural numbers",
"strict inequations",
"nonstrict inequations",
"upper bound",
"corresponding algorithms",
"considered types",
"mixed types"
])
})
This test is currently failing with the following output:
Array [
"minimal generating sets",
- "linear diophantine equations",
- "minimal set",
"minimal supporting set",
+ "mixed types",
+ "nonstrict inequations",
+ "Upper bounds",
"linear constraints",
- "natural numbers",
- "strict inequations",
- "nonstrict inequations",
- "upper bound",
- "corresponding algorithms",
- "considered types",
- "mixed types",
+ "set",
+ "compatibility",
+ "system",
+ "solutions",
+ "algorithms",
+ "construction",
+ "systems",
+ "criteria",
+ "considered",
+ "solving",
+ "components",
]
Any thoughts on what could be causing such a difference?
A little further research. There is a python implementation that looks like they got it. With a little formatting for clarity:
mike@bullseye:~/projects/cloned/RAKE$ python rake.py
[
('minimal generating sets', 8.666666666666666),
('linear diophantine equations', 8.5),
('minimal supporting set', 7.666666666666666),
('minimal set', 4.666666666666666),
('linear constraints', 4.5),
('upper bounds', 4.0),
('natural numbers', 4.0),
('nonstrict inequations', 4.0)
]
[
('minimal generating sets', 8.666666666666666),
('linear diophantine equations', 8.5),
('minimal supporting set', 7.666666666666666),
('minimal set', 4.666666666666666),
('linear constraints', 4.5),
('upper bounds', 4.0),
('natural numbers', 4.0),
('nonstrict inequations', 4.0),
('strict inequations', 4.0),
('mixed types', 3.666666666666667),
('considered types', 3.166666666666667),
('set', 2.0), ('types', 1.6666666666666667),
('considered', 1.5),
('constructing', 1.0),
('solutions', 1.0),
('solving', 1.0),
('system', 1.0),
('compatibility', 1.0),
('systems', 1.0),
('criteria', 1.0),
('construction', 1.0),
('algorithms', 1.0),
('components', 1.0)
]
The entire problem is only with the regex that's been used in generatePhrases
. I just quickly wrote it and released the library. I was about to look into that regex. It tops my priority list now!
By the way, thanks for the research! 👍 :)