Error when trying to use `nlp.pipe` with `n_process` > 1
DayalStrub opened this issue · 3 comments
Intro
I am getting TypeError: can not serialize 'BaseTextRank' object
when trying to use spaCy's multiprocessing in nlp.pipe
with a textrank
pipeline component.
Sorry if this a known/expected feature/limitation - I couldn't find anything by searching repo. I generally find (spaCy's) multiprocessing a bit temperamental anyhow, but this seems to just not work.
PS. thanks for all the great work on the package!
Environment
Ubuntu 18.X (AWS DL AMI), Python 3.8 (via conda/mamba), pytextrank installed via pip, thtough conda - do let me know if you need more info.
Reproducible example - hopefullly
import spacy
import pytextrank
import en_core_web_sm
nlp = en_core_web_sm.load()
nlp.add_pipe("textrank", last=True);
txt = """
The Old Testament of the King James Bible
The First Book of Moses: Called Genesis
1:1 In the beginning God created the heaven and the earth.
1:2 And the earth was without form, and void; and darkness was upon
the face of the deep. And the Spirit of God moved upon the face of the
waters.
1:3 And God said, Let there be light: and there was light.
1:4 And God saw the light, that it was good: and God divided the light
from the darkness.
1:5 And God called the light Day, and the darkness he called Night.
And the evening and the morning were the first day.
...
"""
data = []
for i in range(50):
data.append((txt, {"doc_id": i}))
keys = []
for doc, context in nlp.pipe(data, as_tuples=True, n_process=-1): ## NOTE throws error, but hangs. work with n_process=1
out = {"doc_id": context["doc_id"], "keyphrases": [(phr.text, phr.rank) for phr in doc._.phrases]}
keys.append(out)
# pd.DataFrame(keys).head()
keys
Thank you @DayalStrub -
This is good. I don't recall that we've had any cases using the multi-processor option in spaCy
previously.
To confirm, when running Language.pipe()
with a number of processors other than the default 1
value,
import pytextrank
import spacy
import en_core_web_sm
txt = """To return to my trees. This, as you know, is something that I do often. But sometimes, I even surprise myself with how powerful the pull of trees can be. Take this latest tree. I walked out onto this huge expanse of hard sand and then headed directly across to where there was this amazing old fir tree whose growth seems to have split the sandstone, its top is blown off, and its roots getting salted with every winter storm. I could not easily capture its grandness in one image so I pieced a few together and relied mostly on a short video for painting references. After all the little plein air paintings, this is my first studio painting from Hornby Island. Well, let’s see what we have shall we?"""
nlp = en_core_web_sm.load()
nlp.add_pipe("textrank", last=True);
doc = nlp(txt)
data = [
(txt, {"doc_id": i})
for i in range(5)
]
## `n_process=-1` throws exception
## `n_process=1` works
for doc, context in nlp.pipe(data, as_tuples=True, n_process=1):
out = {"doc_id": context["doc_id"], "keyphrases": [(phr.text, phr.rank) for phr in doc._.phrases]}
print(out)
Then pytextrank
causes an exception to be thrown:
Process Process-1:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/Users/paco/src/pytextrank/venv/lib/python3.7/site-packages/spacy/language.py", line 2007, in _apply_pipes
sender.send([doc.to_bytes() for doc in docs])
File "/Users/paco/src/pytextrank/venv/lib/python3.7/site-packages/spacy/language.py", line 2007, in <listcomp>
sender.send([doc.to_bytes() for doc in docs])
File "spacy/tokens/doc.pyx", line 1237, in spacy.tokens.doc.Doc.to_bytes
File "spacy/tokens/doc.pyx", line 1296, in spacy.tokens.doc.Doc.to_dict
File "/Users/paco/src/pytextrank/venv/lib/python3.7/site-packages/spacy/util.py", line 1134, in to_dict
serialized[key] = getter()
File "spacy/tokens/doc.pyx", line 1293, in spacy.tokens.doc.Doc.to_dict.lambda18
File "/Users/paco/src/pytextrank/venv/lib/python3.7/site-packages/srsly/_msgpack_api.py", line 14, in msgpack_dumps
return msgpack.dumps(data, use_bin_type=True)
File "/Users/paco/src/pytextrank/venv/lib/python3.7/site-packages/srsly/msgpack/__init__.py", line 55, in packb
return Packer(**kwargs).pack(o)
File "srsly/msgpack/_packer.pyx", line 285, in srsly.msgpack._packer.Packer.pack
File "srsly/msgpack/_packer.pyx", line 291, in srsly.msgpack._packer.Packer.pack
File "srsly/msgpack/_packer.pyx", line 288, in srsly.msgpack._packer.Packer.pack
File "srsly/msgpack/_packer.pyx", line 264, in srsly.msgpack._packer.Packer._pack
File "srsly/msgpack/_packer.pyx", line 282, in srsly.msgpack._packer.Packer._pack
TypeError: can not serialize 'BaseTextRank' object
So we need to make the pytextrank
base class and subclasses per algorithm to be serializable.
This would also be needed if we ever wanted to run distributed, say on a Ray cluster.
This appears to be happening in several cases in spaCy
and some of the GH issues point to using srsly
https://github.com/explosion/srsly to resolving serialization issues.
any update on this bug ?
happy to help if needed