Decoding from file-like object reads too much data (C extension only)

Question

Decoding from file-like object reads too much data (C extension only)

vtermanis opened this issue 5 years ago · 3 comments

Problem

When decoding from a file-like object (using load() via C extension), up to BUFFER_FP_SIZE bytes are buffered when read()ing from it. This means that data is likely to be consumed past the end of the encoded ubjson block which is subsequently not available to either decode additional ubjson blocks or for other purposes.

Potential solutions

seek() back to recover unused read data.
- File-like objects should not need to support seek()
Don't buffer at all and only read as much data as needed (often single bytes).
- Underlying file-like object might have its own buffering already
- Pure-Python version already does this

Test case

import ubjson
from io import BytesIO

# Only applies to C extension
assert ubjson.EXTENSION_ENABLED

sample_input = 'something to encode'
output = BytesIO()

# Produce output with multiple serialised ubjson "documents"
for _ in range(10):
    ubjson.dump(sample_input, output)

# Decode all of the documents
output.seek(0)
for i  in range(10):
    print(i)
    assert sample_input == ubjson.load(output)

Expected result

All 10 documents are decoded.

Actual result

0
1
Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
ubjson.decoder.DecoderException: Insufficient input (Type marker)

Answer 1 · 2019-05-11T23:08:21.000Z

Initial solution - remove buffering completely

Overview

Read the exact requested character count from file-like objects read() method. This results in significantly more calls to said method (as e.g. type markers result in a one byte read) and leads to considerable performance degradation.

Potential alternative

If file-like object is seekable() (e.g. file on local filesystem), enable buffering, otherwise use no buffering (e.g. socket).

General testing

full suite with added test for issue
coverage_test.sh with compiled extension
pympler reference leak test (see bottom of test.py)

Performance

Medium-size varied-content decoding

Steps

python3 -mubjson fromjson test/data/CouchDB4k.compact.json /tmp/CouchDB4k.compact.ubjson
python -mtimeit --n 50000 \
 -s "
from io import BytesIO
from ubjson import load, __version__
print(__version__)
raw = BytesIO()
with open('/tmp/CouchDB4k.compact.ubjson', 'rb') as f:
    raw.write(f.read())
"\
 'raw.seek(0); load(raw)'

Python3 output

Before

0.13.0
0.13.0
0.13.0
50000 loops, best of 3: 23.4 usec per loop

After

0.14.0
0.14.0
0.14.0
50000 loops, best of 3: 73.8 usec per loop

Python2 output

Before

0.13.0
0.13.0
0.13.0
50000 loops, best of 3: 25.5 usec per loop

After

0.14.0
0.14.0
0.14.0
50000 loops, best of 3: 92.5 usec per loop

Large file decoding (62MB with small fields)

python3 -mtimeit -r1 -n1 -s "
from ubjson import load, __version__
print(__version__)
" "
with open('DEFRA.uk_air.ubj', 'rb') as f:
    load(f, intern_object_keys=True)
"

Python3 output

Before

0.13.0
1 loops, best of 1: 2.08 sec per loop

After

0.14.0
1 loops, best of 1: 5.53 sec per loop

Answer 2 · 2019-06-04T16:14:07.000Z

#11 addresses performance concerns by using three methods for reading input from:

Fixed single-dimension byte sequence (as before)
Buffered from a seek()-able file-like object (as before)
Unbuffered from a file-like object (new)

Answer 3 · 2019-06-20T17:38:11.000Z

Fixed in 0.14.0.