Decoding from file-like object reads too much data (C extension only)
vtermanis opened this issue · 3 comments
Problem
When decoding from a file-like object (using load()
via C extension), up to BUFFER_FP_SIZE bytes are buffered when read()
ing from it. This means that data is likely to be consumed past the end of the encoded ubjson block which is subsequently not available to either decode additional ubjson blocks or for other purposes.
Potential solutions
seek()
back to recover unused read data.- File-like objects should not need to support
seek()
- File-like objects should not need to support
- Don't buffer at all and only read as much data as needed (often single bytes).
- Underlying file-like object might have its own buffering already
- Pure-Python version already does this
Test case
import ubjson
from io import BytesIO
# Only applies to C extension
assert ubjson.EXTENSION_ENABLED
sample_input = 'something to encode'
output = BytesIO()
# Produce output with multiple serialised ubjson "documents"
for _ in range(10):
ubjson.dump(sample_input, output)
# Decode all of the documents
output.seek(0)
for i in range(10):
print(i)
assert sample_input == ubjson.load(output)
Expected result
All 10 documents are decoded.
Actual result
0
1
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
ubjson.decoder.DecoderException: Insufficient input (Type marker)
Initial solution - remove buffering completely
Overview
Read the exact requested character count from file-like objects read()
method. This results in significantly more calls to said method (as e.g. type markers result in a one byte read) and leads to considerable performance degradation.
Potential alternative
If file-like object is seekable()
(e.g. file on local filesystem), enable buffering, otherwise use no buffering (e.g. socket).
General testing
- full suite with added test for issue
coverage_test.sh
with compiled extensionpympler
reference leak test (see bottom oftest.py
)
Performance
Medium-size varied-content decoding
Steps
python3 -mubjson fromjson test/data/CouchDB4k.compact.json /tmp/CouchDB4k.compact.ubjson
python -mtimeit --n 50000 \
-s "
from io import BytesIO
from ubjson import load, __version__
print(__version__)
raw = BytesIO()
with open('/tmp/CouchDB4k.compact.ubjson', 'rb') as f:
raw.write(f.read())
"\
'raw.seek(0); load(raw)'
Python3 output
Before
0.13.0
0.13.0
0.13.0
50000 loops, best of 3: 23.4 usec per loop
After
0.14.0
0.14.0
0.14.0
50000 loops, best of 3: 73.8 usec per loop
Python2 output
Before
0.13.0
0.13.0
0.13.0
50000 loops, best of 3: 25.5 usec per loop
After
0.14.0
0.14.0
0.14.0
50000 loops, best of 3: 92.5 usec per loop
Large file decoding (62MB with small fields)
python3 -mtimeit -r1 -n1 -s "
from ubjson import load, __version__
print(__version__)
" "
with open('DEFRA.uk_air.ubj', 'rb') as f:
load(f, intern_object_keys=True)
"
Python3 output
Before
0.13.0
1 loops, best of 1: 2.08 sec per loop
After
0.14.0
1 loops, best of 1: 5.53 sec per loop
#11 addresses performance concerns by using three methods for reading input from:
- Fixed single-dimension byte sequence (as before)
- Buffered from a
seek()
-able file-like object (as before) - Unbuffered from a file-like object (new)