nltk/nltk

SnowballStemmer: how to avoid transliteration?

satyrmipt opened this issue · 1 comments

Please look at the code below. Is there a way to avoid transliteration of "sheet" substring to "шеет" one in the 2nd case?

Code:
from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer(language='russian')
stemmer.stem(""), stemmer.stem("русский текст"), stemmer.stem("english text")

Output:
('', '<шеет>русский текст</шеет>', 'english text')

Oh, i forget the formatting and my question is incorrect. Let's try again:

Code:

from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer(language='russian')
stemmer.stem("<sheet>"), stemmer.stem("<sheet>русский текст</sheet>"), stemmer.stem("<sheet>english text</sheet>")

Output:

('<sheet>', '<шеет>русский текст</шеет>', '<sheet>english text</sheet>')

Question:
how to avoid "sheet" -> "шеет" transliteration?