nlpub/pymystem3

Slow lemmatization on Windows

QtRoS opened this issue · 3 comments

QtRoS commented

This code in Jupyter notebook:

%%time
from pymystem3 import Mystem
m = Mystem()
text = "Красивая мама красиво мыла раму"
for i in range(10):
    lemmas = m.lemmatize(text)

Outputs Wall time: 9.27 s
Every iteration takes almost 1 second. This is dramatically different from results I observe in Ubuntu.
My configuration: Windows 7 x64, i7 4770 3.6Ghz, 16GB RAM

Hello,

This seems to be related to #11. Unfortunately, I am not sure how to make it faster on Windows due to lack of knowledge of this platform. Any patches that would improve it are welcome.

qiray commented

Well, I found a bad hack solution for increasing performance for lemmatize long texts. In file mystem.py we have function

    def analyze(self, text):
        """
        Make morphology analysis for a text.
        :param  text:   text to analyze
        :type   text:   str
        :returns:       result of morphology analysis.
        :rtype:         dict
        """

        result = []
        for line in text.splitlines():
            try:
                result.extend(self._analyze_impl(line))
            except broken_pipe:
                self.close()
                self.start()
                result.extend(self._analyze_impl(line))
        return result

We can change this one for not _PIPELINE_MODE:

if not _PIPELINE_MODE:
    def analyze(self, text):
        """
        Make morphology analysis for a text.
        :param  text:   text to analyze
        :type   text:   str
        :returns:       result of morphology analysis.
        :rtype:         dict
        """

        result = []
        span = 2000
        lines = text.splitlines()
        lines = [" ".join(lines[i:i+span]) for i in range(0, len(lines), span)]

        for line in lines:
            try:
                result.extend(self._analyze_impl(line))
            except broken_pipe:
                self.close()
                self.start()
                result.extend(self._analyze_impl(line))
        return result

This changes dramatically increase performance in Windows. Unfortunately I don't know if it's fine to send such long "lines" to Mystem. But it works for me.

Hi, is this library still being maintained? I think that the problem of slowing down in Windows is very critical, why not to apply, for example, the solution of @qiray ?