explosion/sense2vec

Object too large error in preprocessing script

ahalterman opened this issue · 4 comments

I've been getting a "bytes object is too large" error when processing a large-ish number of documents using the 01_parse.py script. Creating several smaller doc_bin objects resolves the issue. Full error:

ahalt@xxxxxxxx:~/sense2vec$ python sense2vec/scripts/01_parse.py hindu_complete.txt docbins en_core_web_sm -n 10
ℹ Using spaCy model en_core_web_sm
Preprocessing text...
Docs: 267103 [1:00:38, 73.42/s]
✔ Processed 267103 docs
Traceback (most recent call last):
  File "sense2vec/scripts/01_parse.py", line 47, in <module>
    plac.call(main)
  File "/home/ahalt/anaconda3/lib/python3.6/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/home/ahalt/anaconda3/lib/python3.6/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "sense2vec/scripts/01_parse.py", line 39, in main
    doc_bin_bytes = doc_bin.to_bytes()
  File "/home/ahalt/anaconda3/lib/python3.6/site-packages/spacy/tokens/_serialize.py", line 151, in to_bytes
    return zlib.compress(srsly.msgpack_dumps(msg))
  File "/home/ahalt/anaconda3/lib/python3.6/site-packages/srsly/_msgpack_api.py", line 16, in msgpack_dumps
    return msgpack.dumps(data, use_bin_type=True)
  File "/home/ahalt/anaconda3/lib/python3.6/site-packages/srsly/msgpack/__init__.py", line 40, in packb
    return Packer(**kwargs).pack(o)
  File "_packer.pyx", line 285, in srsly.msgpack._packer.Packer.pack
  File "_packer.pyx", line 291, in srsly.msgpack._packer.Packer.pack
  File "_packer.pyx", line 288, in srsly.msgpack._packer.Packer.pack
  File "_packer.pyx", line 235, in srsly.msgpack._packer.Packer._pack
  File "_packer.pyx", line 206, in srsly.msgpack._packer.Packer._pack
ValueError: bytes object is too large

If you end up splitting the output files in the 01_parse.py script, you can easily run the preprocessing script over each of them using GNU parallel:

find docbins/ -name '*.spacy' | parallel --jobs 10 python sense2vec/scripts/02_preprocess.py {} s2v_format/ en_core_web_sm 

I had the same problem. See the error message below. After doing some more preprocessing, however, I no longer get the "bytes object is too large" ValueError. Preprocessing steps: (1) removed duplicates, (2) stripped whitespace from sentence end, (3) removed sentences of length > 2520 characters, (4) removed sentences of length < 11 characters. These 4 steps cut my dataset by 74% from 7,487,357 sentences to 1,978,295. So, I'm not sure which of those steps fixed the problem, but I no longer get the "bytes object is too large" ValueError.

~/sense2vec$ python scripts/01_parse.py ../corpus2.txt ../corpus_parsed2 en_core_web_lg --n 14
✔ Created output directory ../corpus_parsed2
ℹ Using spaCy model en_core_web_lg
Preprocessing text...
Docs: 7487357 [57:44, 2161.41/s]
✔ Processed 7487357 docs
Traceback (most recent call last):
  File "01_parse.py", line 45, in <module>
    plac.call(main)
  File "~\AppData\Local\Continuum\anaconda3\lib\site-packages\plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "~\AppData\Local\Continuum\anaconda3\lib\site-packages\plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "01_parse.py", line 37, in main
    doc_bin_bytes = doc_bin.to_bytes()
  File "~\AppData\Local\Continuum\anaconda3\lib\site-packages\spacy\tokens\_serialize.py", line 151, in to_bytes
    return zlib.compress(srsly.msgpack_dumps(msg))
  File "~\AppData\Local\Continuum\anaconda3\lib\site-packages\srsly\_msgpack_api.py", line 16, in msgpack_dumps
    return msgpack.dumps(data, use_bin_type=True)
  File "~\AppData\Local\Continuum\anaconda3\lib\site-packages\srsly\msgpack\__init__.py", line 40, in packb
    return Packer(**kwargs).pack(o)
  File "_packer.pyx", line 285, in srsly.msgpack._packer.Packer.pack
  File "_packer.pyx", line 291, in srsly.msgpack._packer.Packer.pack
  File "_packer.pyx", line 288, in srsly.msgpack._packer.Packer.pack
  File "_packer.pyx", line 235, in srsly.msgpack._packer.Packer._pack
  File "_packer.pyx", line 206, in srsly.msgpack._packer.Packer._pack
ValueError: bytes object is too large

How big are each of your documents? Is each one a sentence or is it more like a news article? Mine are around 500 words/3000-4000 characters, so if yours are sentence-length that could keep you below the memory limit. (That could also explain why you're getting 2,000 docs/second and I'm getting 100/second on 14 cores.)

In general, though, it's not ideal to have to trim the corpus to prevent an out-of-memory error. I'm about to train vectors on a much larger corpus of text so I'll see how the splitting solution in #103 works.

How big are each of your documents? Is each one a sentence or is it more like a news article?

Each of my documents is a sentence that is 120 characters, on average. So, I agree with your statements.