Slow lemmatization on Windows
qiray opened this issue · 2 comments
Well, I found a bad hack solution for increasing performance for lemmatize long texts. I added comment in #14 but I think it's better to create new issue for this.
So in file mystem.py we have function
def analyze(self, text):
"""
Make morphology analysis for a text.
:param text: text to analyze
:type text: str
:returns: result of morphology analysis.
:rtype: dict
"""
result = []
for line in text.splitlines():
try:
result.extend(self._analyze_impl(line))
except broken_pipe:
self.close()
self.start()
result.extend(self._analyze_impl(line))
return result
We can change this one for not _PIPELINE_MODE:
if not _PIPELINE_MODE:
def analyze(self, text):
"""
Make morphology analysis for a text.
:param text: text to analyze
:type text: str
:returns: result of morphology analysis.
:rtype: dict
"""
result = []
span = 2000
lines = text.splitlines()
lines = [" ".join(lines[i:i+span]) for i in range(0, len(lines), span)]
for line in lines:
try:
result.extend(self._analyze_impl(line))
except broken_pipe:
self.close()
self.start()
result.extend(self._analyze_impl(line))
return result
This changes dramatically increase performance in Windows. Unfortunately I don't know if it's fine to send such long "lines" to Mystem. But it works for me.
IMO the real way to fix this (besides running mystem in WSL or switching to Linux) is to
- Use named pipes on Windows (I've experimented with this with no success 😔)
- Use the new pseudoconsole API (I didn't experiment with this at all)
Right now, I decided to run mystem + pymystem3 in WSL (where pipes actually work) and communicate with the Windows side using ZeroMQ over TCP. This adds some overhead, but works fast enough for me.
OK. Thanks for the investigation. I will close then for now.