Memory Error for CREC parser
Closed this issue · 4 comments
For some dates, a memory error occurs when parsing that day's Congressional Record files:
Traceback (most recent call last):
File "/mnt/capitolwords/capitolweb/parser/management/commands/run_crec_parser.py", line 85, in handle
es_doc = crec.to_es_doc()
File "/mnt/capitolwords/capitolweb/parser/crec_parser.py", line 383, in to_es_doc
segments=self.segments,
File "/usr/local/lib/python3.5/dist-packages/django/utils/functional.py", line 35, in __get__
res = instance.__dict__[self.name] = self.func(instance)
File "/mnt/capitolwords/capitolweb/parser/crec_parser.py", line 324, in segments
sents = (sent.string for sent in self.textacy_text.spacy_doc.sents)
File "/usr/local/lib/python3.5/dist-packages/django/utils/functional.py", line 35, in __get__
res = instance.__dict__[self.name] = self.func(instance)
File "/mnt/capitolwords/capitolweb/parser/crec_parser.py", line 231, in textacy_text
return textacy.Doc(SPACY_NLP(text))
File "/usr/local/lib/python3.5/dist-packages/spacy/language.py", line 341, in __call__
doc = proc(doc)
File "nn_parser.pyx", line 337, in spacy.syntax.nn_parser.Parser.__call__
File "nn_parser.pyx", line 400, in spacy.syntax.nn_parser.Parser.parse_batch
File "nn_parser.pyx", line 725, in spacy.syntax.nn_parser.Parser.get_batch_model
File "nn_parser.pyx", line 84, in spacy.syntax.nn_parser.precompute_hiddens.__init__
File "/usr/local/lib/python3.5/dist-packages/spacy/_ml.py", line 148, in begin_update
self.W.reshape((self.nF*self.nO*self.nP, self.nI)).T)
MemoryError
Can you post the command line arguments this failed on? Or any dates that this bug occurs for.
Sure thing. This fails for a handful of dates so far. Among them: 2016-09-13 and 2016-09-12. The command:
python3 manage.py run_crec_parser --start_date=2016-09-11 --end_date=2016-09-13
I wasn't able to reproduce this on my laptop, but that has 16gb of memory so its possible that the days that trigger this error just have a larger than normal amount of text to process. So, I would first try running this on a machine with more ram if you haven't already done so. Alternatively, you can try running it with an older version of spacy ("pip install spacy<2.0") as this may be related to an issue in the newer version (nothing we're doing in the capitol words code requires any newer features).
@will-horning Ok, thanks! I'll try both of those options.