Unexpected results for german text with umlauts

Question

Unexpected results for german text with umlauts

Closed this issue 7 years ago · 4 comments

jbernau commented 7 years ago

The following small sample program demonstrates the problem

import RAKE
Rake = RAKE.Rake(['da']);
print(Rake.run(u'und da\xdfselbe nochmal'))

as it returns [(u'\xdfselbe nochmal', 4.0), (u'und', 1.0)]

Tested with python 2.7.6

The issue seems to be, that the regex for word spliting and stopword removal are not unicode.

Answer 1 · 2017-11-29T10:34:52.000Z

The following changes solved the issue for me

diff --git a/RAKE/RAKE.py b/RAKE/RAKE.py
index a147f04..e8263ae 100644
--- a/RAKE/RAKE.py
+++ b/RAKE/RAKE.py
@@ -64,7 +64,7 @@ def separate_words(text):
     @param text The text that must be split in to words.
     @param min_word_return_size The minimum no of characters a word must have to be included.
     """
-    splitter = re.compile('\W+')
+    splitter = re.compile('(?u)\W+')
     words = []
     for single_word in splitter.split(text):
         current_word = single_word.strip().lower()
@@ -89,7 +89,7 @@ def build_stop_word_regex(stop_word_list):
     for word in stop_word_list:
         word_regex = r'\b' + word + r'(?![\w-])'
         stop_word_regex_list.append(word_regex)
-    return re.compile('|'.join(stop_word_regex_list), re.IGNORECASE)
+    return re.compile('(?u)'+'|'.join(stop_word_regex_list), re.IGNORECASE)

Answer 2 · 2017-11-29T17:46:11.000Z

Just to humor me before I start looking into this, could you pip uninstall and pip install it again and confirm the issue persists? A week or so ago we pushed a fix that should've addressed that exact issue, and we've historically had problems with pip doing proper upgrades.

On Wed, Nov 29, 2017, 5:34 AM Jürgen Bernau ***@***.***> wrote: The following changes solved the issue for me diff --git a/RAKE/RAKE.py b/RAKE/RAKE.py index a147f04..e8263ae 100644 --- a/RAKE/RAKE.py +++ b/RAKE/RAKE.py @@ -64,7 +64,7 @@ def separate_words(text): @param text The text that must be split in to words. @param min_word_return_size The minimum no of characters a word must have to be included. """ - splitter = re.compile('\W+') + splitter = re.compile('(?u)\W+') words = [] for single_word in splitter.split(text): current_word = single_word.strip().lower() @@ -89,7 +89,7 @@ def build_stop_word_regex(stop_word_list): for word in stop_word_list: word_regex = r'\b' + word + r'(?![\w-])' stop_word_regex_list.append(word_regex) - return re.compile('|'.join(stop_word_regex_list), re.IGNORECASE) + return re.compile('(?u)'+'|'.join(stop_word_regex_list), re.IGNORECASE) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#33 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AShd7Kxo1-olQPef84lE3wYWf2doab8Iks5s7TNMgaJpZM4Quu-w> .

-- Thank you for your time, Justin Terry

Answer 3 · 2017-11-30T13:17:48.000Z

Thanks for the quick reply!

I uninstalled/reinstalled. Problem persists.

I noticed the recent issue. The fix addressed the sentence_delimiters regex. My change fixes the word_splitter and the stop_word_regex. Which - to me - have the similar problem with unicode.

Answer 4 · 2017-11-30T20:15:51.000Z

Awesome, you're right. Can you test your fix in python 3.x and create a PR I can look at?