waseem18/node-rake

Input text from RAKE paper shows different output

sleepycat opened this issue · 2 comments

Hey @waseem18!
In one of the tests the text includes "'committal, theory" and I noticed that node-rake listed one of the keywords as "'committal theory". This makes me think there is something not quite right in the algorithm. Maybe a regex needs some adjustment.

To dig a little deeper I looked at the paper (I haven't read the whole thing yet) but it actually has some example text and lists the output pages 161-162. I thought that would make a good test:

    it('produces the output from the paper', () => {
      let textFromThePaper = `
      Compatibility of systems of linear constraints over the set of natural numbers

      Criteria of compatibility of a system of linear Diophantine equations, strict inequations,
	and nonstrict inequations are considered. Upper bounds for components of a minimal set
      of solutions and algorithms of construction of minimal generating sets of solutions for all
      types of systems are given. These criteria and the corresponding algorithms for
      constructing a minimal supporting set of solutions can be used in solving all the
      considered types of systems and systems of mixed types.
      `
      let results = rake.generate(textFromThePaper)
      expect(results).toEqual([
	"minimal generating sets",
	"linear diophantine equations",
	"minimal set",
	"minimal supporting set",
	"linear constraints",
	"natural numbers",
	"strict inequations",
	"nonstrict inequations",
	"upper bound",
	"corresponding algorithms",
	"considered types",
	"mixed types"
      ])
    })

This test is currently failing with the following output:

     Array [
       "minimal generating sets",
    -  "linear diophantine equations",
    -  "minimal set",
       "minimal supporting set",
    +  "mixed types",
    +  "nonstrict inequations",
    +  "Upper bounds",
       "linear constraints",
    -  "natural numbers",
    -  "strict inequations",
    -  "nonstrict inequations",
    -  "upper bound",
    -  "corresponding algorithms",
    -  "considered types",
    -  "mixed types",
    +  "set",
    +  "compatibility",
    +  "system",
    +  "solutions",
    +  "algorithms",
    +  "construction",
    +  "systems",
    +  "criteria",
    +  "considered",
    +  "solving",
    +  "components",
     ]

Any thoughts on what could be causing such a difference?

A little further research. There is a python implementation that looks like they got it. With a little formatting for clarity:

mike@bullseye:~/projects/cloned/RAKE$ python rake.py 
[
  ('minimal generating sets', 8.666666666666666),
  ('linear diophantine equations', 8.5),
  ('minimal supporting set', 7.666666666666666),
  ('minimal set', 4.666666666666666),
  ('linear constraints', 4.5),
  ('upper bounds', 4.0),
  ('natural numbers', 4.0),
  ('nonstrict inequations', 4.0)
]
[
  ('minimal generating sets', 8.666666666666666),
  ('linear diophantine equations', 8.5),
  ('minimal supporting set', 7.666666666666666),
  ('minimal set', 4.666666666666666),
  ('linear constraints', 4.5),
  ('upper bounds', 4.0),
  ('natural numbers', 4.0),
  ('nonstrict inequations', 4.0),
  ('strict inequations', 4.0),
  ('mixed types', 3.666666666666667),
  ('considered types', 3.166666666666667),
  ('set', 2.0), ('types', 1.6666666666666667),
  ('considered', 1.5),
  ('constructing', 1.0),
  ('solutions', 1.0),
  ('solving', 1.0),
  ('system', 1.0),
  ('compatibility', 1.0),
  ('systems', 1.0),
  ('criteria', 1.0),
  ('construction', 1.0),
  ('algorithms', 1.0),
  ('components', 1.0)
]

The entire problem is only with the regex that's been used in generatePhrases. I just quickly wrote it and released the library. I was about to look into that regex. It tops my priority list now!

By the way, thanks for the research! 👍 :)