TeskaLabs/cysimdjson

Reusing parser modifies previous results

ROpdebee opened this issue · 4 comments

Installed via PyPI

from cysimdjson import JSONParser
parser = JSONParser()
result = parser.parse(b'{"hello": "world"}')
print(result['hello'])  # "world"
new_result = parser.parse(b'{"hello": "universe"}')
print(result['hello'])  # "universe"

This only happens when reusing a previous parser instance. I'm not sure whether this is by design, but if it is, it should probably be explicitly documented to avoid confusion.

It also becomes especially iffy when mixing different types:

result = parser.parse(b'{"hello": "world"}')
print(type(result))  # JSONObject
print(list(result.keys()))  # ['hello']
new_result = parser.parse(b'[1,2,3]')
print(type(result))  # JSONObject
print(type(new_result))  # JSONArray
print(list(result.keys()))  # ['hello', 'hello', 'hello']

So if it is by design, it might be worthwhile to somehow invalidate the previous reference when a new document is parsed.

Thanks for reporting, we will look at that.

Ok, after some digging, this is actually correct behaviour - it is linked with the SIMDJSON requirement for the lifecycle of the document, see the remark at https://simdjson.org/api/0.9.1/classsimdjson_1_1dom_1_1parser.html#ab3e5bbb1974a1932aead90ad63883a23

I will try to provide some kind of indication of this as per your suggestion.

Note that the parser holds the allocated memory. By reusing the parser from document to document, we reuse the allocated memory. The net result is to avoid memory reallocation, an expensive process. On some systems, it is more expensive to allocate memory than to parse JSON !

You can see one approach to stopping this in pysimdjson, https://github.com/TkTech/pysimdjson/blob/master/simdjson/csimdjson.pyx#L437. It's likely not our final fix, since it's a bit buggy on pypy whose garbage collector might wait a long, long time to cleanup things pointing into the Parser's document even if it's been explicitly deleted. In @ROpdebee's example above, it would have raised a RuntimeError instead of a potential segfault.