waseem18/node-rake

Non-ascii letters not recognised

premasagar opened this issue · 2 comments

For this simple query in Spanish, with empty stopwords (or with stopwords; it doesn't matter):

rake.generate("Cuantos años tienes?", {stopwords: []})

I get the error:

TypeError: Cannot read property 'forEach' of null
    at phraseList.forEach
    at Array.forEach
    at Rake.calculatePhraseScores

If I omit the stopwords, then there is no error, but the word "años" is incorrectly split up:

rake.generate("Cuantos años tienes?")

=> [ 'ños tienes', 'Cuantos' ]

I think the code is treating the ñ as a word-break character, leading to the word being split in the second example, and leading to the single character ñ being used as a whole phrase in the function calculatePhraseScores, which leads to the error in the first example. The wordList regex seems to be looking only for 0-9a-z as acceptable word characters, which will be incomplete.

It happens the same when the text has any tildes, meaning: áéíóú

fmalk commented

Late to the party, but on my own fork I'm testing changing this line

const wordList = phrase.match(/[,.!?;:/‘’“”]|\b[0-9a-z']+\b/gi);
as indicated by @premasagar to this:

phrase.match(/[,.!?;:/‘’“”]|\b[\p{L}\p{M}']+\b/giu);

Which is supports any Unicode letter and some unicode markings, basically making this code work with any language. See Regex Unicode

(Edit: forgot the backspace \p)