[Bug]: Failed repair on some quote cases
Closed this issue · 1 comments
Robin-Dong commented
Version of the library
0.25.2
Describe the bug
As shown by the cases below, IDs 1, 4, and 5 failed during the repair.
input: {"na"me": "Jack O"Sullivan", "id": "1"}
output: {"na": "e", "Jack O": "ullivan", "id": "1"}
------------
input: {"name": "Jack: The "OG" O"Sullivan"", "id": "2"}
output: {"name": "Jack: The \"OG\" O\"Sullivan\"", "id": "2"}
------------
input: {"name": "Jack: The "OG"", "surname": 'O'Sullivan', "id": "3"}
output: {"name": "Jack: The \"OG\"", "surname": "O'Sullivan", "id": "3"}
------------
input: {"test_str": {"1singlechar": "a""a""a", "2singlechars": "a"a"a"a"a"a"a"a"a"}, "id": "4"}
output: {"test_str": {"1singlechar": "a\"", "a": "a", "2singlechars": "a\"a\"a\"a\"a\"a\"a\"a\"a"}, "id": "4"}
------------
input: {'name': 'Jack O'Sullivan, 'id': '5'}
output: {"name": "Jack O", "id": "5"}
------------
How to reproduce
from json_repair import repair_json
req_jsons = [
'{"na"me": "Jack O"Sullivan", "id": "1"}',
'{"name": "Jack: The "OG" O"Sullivan"", "id": "2"}',
'{"name": "Jack: The "OG"", "surname": \'O\'Sullivan\', "id": "3"}',
'{"test_str": {"1singlechar": "a""a""a", "2singlechars": "a"a"a"a"a"a"a"a"a"}, "id": "4"}',
"{'name': 'Jack O'Sullivan, 'id': '5'}",
]
for bad_json_string in req_jsons:
good_json_string = repair_json(bad_json_string, skip_json_loads=True)
print(f"input: {bad_json_string}\noutput: {good_json_string}")
print("------------")
Expected behavior
input: {"na"me": "Jack O"Sullivan", "id": "1"}
output: {"na\me": "Jack O\"Sullivan", "id": "1"}
------------
input: {"name": "Jack: The "OG" O"Sullivan"", "id": "2"}
output: {"name": "Jack: The \"OG\" O\"Sullivan\"", "id": "2"}
------------
input: {"name": "Jack: The "OG"", "surname": 'O'Sullivan', "id": "3"}
output: {"name": "Jack: The \"OG\"", "surname": "O'Sullivan", "id": "3"}
------------
input: {"test_str": {"1singlechar": "a""a""a", "2singlechars": "a"a"a"a"a"a"a"a"a"}, "id": "4"}
output: {"test_str": {"1singlechar": "a\"\"a\"\"a", "2singlechars": "a\"a\"a\"a\"a\"a\"a\"a\"a"}, "id": "4"}
------------
input: {'name': 'Jack O'Sullivan, 'id': '5'}
output: {"name": "Jack O'Sullivan", "id": "5"}
mangiucugna commented
Hi, those are all tricky cases that clash with other requirements (most notably the need to remove stray LLM comments from objects). Which LLM generated those?