Unexpected results for german text with umlauts
Closed this issue · 4 comments
jbernau commented
The following small sample program demonstrates the problem
import RAKE
Rake = RAKE.Rake(['da']);
print(Rake.run(u'und da\xdfselbe nochmal'))
as it returns [(u'\xdfselbe nochmal', 4.0), (u'und', 1.0)]
Tested with python 2.7.6
The issue seems to be, that the regex for word spliting and stopword removal are not unicode.
jbernau commented
The following changes solved the issue for me
diff --git a/RAKE/RAKE.py b/RAKE/RAKE.py
index a147f04..e8263ae 100644
--- a/RAKE/RAKE.py
+++ b/RAKE/RAKE.py
@@ -64,7 +64,7 @@ def separate_words(text):
@param text The text that must be split in to words.
@param min_word_return_size The minimum no of characters a word must have to be included.
"""
- splitter = re.compile('\W+')
+ splitter = re.compile('(?u)\W+')
words = []
for single_word in splitter.split(text):
current_word = single_word.strip().lower()
@@ -89,7 +89,7 @@ def build_stop_word_regex(stop_word_list):
for word in stop_word_list:
word_regex = r'\b' + word + r'(?![\w-])'
stop_word_regex_list.append(word_regex)
- return re.compile('|'.join(stop_word_regex_list), re.IGNORECASE)
+ return re.compile('(?u)'+'|'.join(stop_word_regex_list), re.IGNORECASE)
jkterry1 commented
Just to humor me before I start looking into this, could you pip uninstall
and pip install it again and confirm the issue persists? A week or so ago
we pushed a fix that should've addressed that exact issue, and we've
historically had problems with pip doing proper upgrades.
On Wed, Nov 29, 2017, 5:34 AM Jürgen Bernau ***@***.***> wrote:
The following changes solved the issue for me
diff --git a/RAKE/RAKE.py b/RAKE/RAKE.py
index a147f04..e8263ae 100644
--- a/RAKE/RAKE.py
+++ b/RAKE/RAKE.py
@@ -64,7 +64,7 @@ def separate_words(text):
@param text The text that must be split in to words.
@param min_word_return_size The minimum no of characters a word must have to be included.
"""
- splitter = re.compile('\W+')
+ splitter = re.compile('(?u)\W+')
words = []
for single_word in splitter.split(text):
current_word = single_word.strip().lower()
@@ -89,7 +89,7 @@ def build_stop_word_regex(stop_word_list):
for word in stop_word_list:
word_regex = r'\b' + word + r'(?![\w-])'
stop_word_regex_list.append(word_regex)
- return re.compile('|'.join(stop_word_regex_list), re.IGNORECASE)
+ return re.compile('(?u)'+'|'.join(stop_word_regex_list), re.IGNORECASE)
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#33 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AShd7Kxo1-olQPef84lE3wYWf2doab8Iks5s7TNMgaJpZM4Quu-w>
.
--
Thank you for your time,
Justin Terry
jbernau commented
Thanks for the quick reply!
I uninstalled/reinstalled. Problem persists.
I noticed the recent issue. The fix addressed the sentence_delimiters regex. My change fixes the word_splitter and the stop_word_regex. Which - to me - have the similar problem with unicode.
jkterry1 commented
Awesome, you're right. Can you test your fix in python 3.x and create a PR I can look at?