[Bug]: json_repair incredibly slow with from_file
Closed this issue · 4 comments
Version of the library
0.25.3
Describe the bug
Parsing a file with from_file
is significantly slower than parsing it with .loads
due to the StringFileWrapper
abstraction. Namely, seeking one byte at a time is incredibly slow.
cProfile and strace show the time is spent in:
11079787 1.396 0.000 27.749 0.000 json_repair.py:619(get_char_at)
11080226 3.403 0.000 26.355 0.000 json_repair.py:36(__getitem__)
Which makes sense since the library is reading one byte at a time
How to reproduce
With the following JSON file:
{
"type": "TPM2_ALG_ECC",
"nameAlg":"TPM2_ALG_SHA256",
"srk_template": "system,restricted,decrypt,0x81000001",
"srk_description": "Storage root key SRK",
"srk_persistent": 0,
"ek_template": "system,restricted,decrypt",
"ek_description": "Endorsement key EK",
"ecc_signing_scheme": {
"scheme":"TPM2_ALG_ECDSA",
"details":{
"hashAlg":"TPM2_ALG_SHA256"
},
},
"sym_mode":"TPM2_ALG_CFB",
"sym_parameters": {
"algorithm":"TPM2_ALG_AES",
"keyBits":"128",
"mode":"TPM2_ALG_CFB"
},
"sym_block_size": 16,
"pcr_selection": [
{ "hash": "TPM2_ALG_SHA1",
"pcrSelect": [ ],
},
{ "hash": "TPM2_ALG_SHA256",
"pcrSelect": [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23 ]
}
],
"curveID": "TPM2_ECC_NIST_P256",
"ek_policy": {
"description": "Endorsement hierarchy used for policy secret.",
"policy":[
{
"type":"POLICYSECRET",
"objectName": "4000000b",
}
]
}
}
Calling:
json_repair.from_file('/tmp/json2.json')
hangs for a long time but json_repair.loads(open('/tmp/json2.json').read())
finishes instantly
Expected behavior
from_file should be as fast as .loads(open(file).read())
Hi! Thanks a lot for the report, finally someone that is using this feature.
I will need to find some time to sit down and think about it.
The basic idea is to avoid loading the entire file in memory because you don’t know how big that is, maybe I am overthinking it.
Nonetheless the library should likely load a large chunk of the file and not one byte at a time
Feel free to think along with me and/or push a PR
Thanks for the quick response!
I think it's fine to buffer the first n
bytes of the file (e.g. 1MB) and fall back on loading it in chunks (of 1MB) at a time if it's larger than that arbitrary threshold.
Here is a draft patch to indicate what I had in mind #67 if you think it's a good direction I'm happy to test it better and run all the tooling.
I'm not sure this is the right approach it's probably still much faster to just check if the file is small and if it's under a size where it's problematic from a streaming/memory PoV just read it.
Thanks for the PR, the direction seems correct, I haven’t had the time to review in detail.
One thing I didn’t see, and that I think is necessary, is the freeing of the buffer to save memory (say you keep max 2 chunks).
This is because otherwise is exactly the same as loading the entire file in memory, the good news is that the library looks back very little so just keeping 2 should be more than enough.