Non-ascii letters not recognised
premasagar opened this issue · 2 comments
For this simple query in Spanish, with empty stopwords (or with stopwords; it doesn't matter):
rake.generate("Cuantos años tienes?", {stopwords: []})
I get the error:
TypeError: Cannot read property 'forEach' of null
at phraseList.forEach
at Array.forEach
at Rake.calculatePhraseScores
If I omit the stopwords, then there is no error, but the word "años" is incorrectly split up:
rake.generate("Cuantos años tienes?")
=> [ 'ños tienes', 'Cuantos' ]
I think the code is treating the ñ
as a word-break character, leading to the word being split in the second example, and leading to the single character ñ
being used as a whole phrase in the function calculatePhraseScores
, which leads to the error in the first example. The wordList
regex seems to be looking only for 0-9a-z
as acceptable word characters, which will be incomplete.
It happens the same when the text has any tildes, meaning: áéíóú
Late to the party, but on my own fork I'm testing changing this line
Line 43 in 123894e
phrase.match(/[,.!?;:/‘’“”]|\b[\p{L}\p{M}']+\b/giu);
Which is supports any Unicode letter and some unicode markings, basically making this code work with any language. See Regex Unicode
(Edit: forgot the backspace \p)