fabianvf/python-rake

Unexpected results for german text with umlauts

Closed this issue · 4 comments

The following small sample program demonstrates the problem

import RAKE
Rake = RAKE.Rake(['da']);
print(Rake.run(u'und da\xdfselbe nochmal'))

as it returns [(u'\xdfselbe nochmal', 4.0), (u'und', 1.0)]

Tested with python 2.7.6

The issue seems to be, that the regex for word spliting and stopword removal are not unicode.

The following changes solved the issue for me

diff --git a/RAKE/RAKE.py b/RAKE/RAKE.py
index a147f04..e8263ae 100644
--- a/RAKE/RAKE.py
+++ b/RAKE/RAKE.py
@@ -64,7 +64,7 @@ def separate_words(text):
     @param text The text that must be split in to words.
     @param min_word_return_size The minimum no of characters a word must have to be included.
     """
-    splitter = re.compile('\W+')
+    splitter = re.compile('(?u)\W+')
     words = []
     for single_word in splitter.split(text):
         current_word = single_word.strip().lower()
@@ -89,7 +89,7 @@ def build_stop_word_regex(stop_word_list):
     for word in stop_word_list:
         word_regex = r'\b' + word + r'(?![\w-])'
         stop_word_regex_list.append(word_regex)
-    return re.compile('|'.join(stop_word_regex_list), re.IGNORECASE)
+    return re.compile('(?u)'+'|'.join(stop_word_regex_list), re.IGNORECASE)

Thanks for the quick reply!

I uninstalled/reinstalled. Problem persists.

I noticed the recent issue. The fix addressed the sentence_delimiters regex. My change fixes the word_splitter and the stop_word_regex. Which - to me - have the similar problem with unicode.

Awesome, you're right. Can you test your fix in python 3.x and create a PR I can look at?