nlpub/pymystem3

Slow lemmatization on Windows

qiray opened this issue · 2 comments

qiray commented

Well, I found a bad hack solution for increasing performance for lemmatize long texts. I added comment in #14 but I think it's better to create new issue for this.

So in file mystem.py we have function

    def analyze(self, text):
        """
        Make morphology analysis for a text.
        :param  text:   text to analyze
        :type   text:   str
        :returns:       result of morphology analysis.
        :rtype:         dict
        """

        result = []
        for line in text.splitlines():
            try:
                result.extend(self._analyze_impl(line))
            except broken_pipe:
                self.close()
                self.start()
                result.extend(self._analyze_impl(line))
        return result

We can change this one for not _PIPELINE_MODE:

if not _PIPELINE_MODE:
    def analyze(self, text):
        """
        Make morphology analysis for a text.
        :param  text:   text to analyze
        :type   text:   str
        :returns:       result of morphology analysis.
        :rtype:         dict
        """

        result = []
        span = 2000
        lines = text.splitlines()
        lines = [" ".join(lines[i:i+span]) for i in range(0, len(lines), span)]

        for line in lines:
            try:
                result.extend(self._analyze_impl(line))
            except broken_pipe:
                self.close()
                self.start()
                result.extend(self._analyze_impl(line))
        return result

This changes dramatically increase performance in Windows. Unfortunately I don't know if it's fine to send such long "lines" to Mystem. But it works for me.

IMO the real way to fix this (besides running mystem in WSL or switching to Linux) is to

  1. Use named pipes on Windows (I've experimented with this with no success 😔)
  2. Use the new pseudoconsole API (I didn't experiment with this at all)

Right now, I decided to run mystem + pymystem3 in WSL (where pipes actually work) and communicate with the Windows side using ZeroMQ over TCP. This adds some overhead, but works fast enough for me.

OK. Thanks for the investigation. I will close then for now.