mangiucugna/json_repair

JSON info gets cut off with misplaced brackets

nomenclature95 opened this issue · 1 comments

Describe the bug
Hi, I was testing the json_repair module for a personal project that extracts information from a medical text and asks a LLM to fill in a JSON. The LLM I'm using is not great at returning well-formatted JSONs. That's why I was wondering how this module might help. However, I noticed that in case the ill-formatted JSON has objects that have extra closing brackets, the JSON parser stops altogether and assumes the JSON is ended, thus cutting off information.

To Reproduce
The ill-formatted JSON string is:

{"claimant_info":{"name":"John Doe","gender":"male","dominant_hand":"right-handed","date_of_birth":"01/01/2000"},"employment_info":{"occupation":"bank clerk","hours_per_week":0,"was_at_workplace_at_time_of_accident":false,"absence_not_working":[{"type":"sleep disturbance and frequent headaches","duration":""}],"work_restrictions":[{"type":""}]},"past_medical_history":[{"disease_or_pathology":"High cholesterol","text_span":""}]},"recovery_time":[{"body_part":"chest, neck, and back","recovery_time_in_days":"3-4 weeks from 1st treatment date or 9 to 12 visits whichever comes first","text_span":""}]},"dates":{"accident_date":"3/20/2021","examination_date":"3/26/2021","next_examination_date":"04/09/2021","signing_date":"3/26/2021 4:54:17 PM"}}

My test code:

from json_repair import json_repair
import json

jsonString = """{"claimant_info":{"name":"John Doe","gender":"male","dominant_hand":"right-handed","date_of_birth":"01/01/2000"},"employment_info":{"occupation":"bank clerk","hours_per_week":0,"was_at_workplace_at_time_of_accident":false,"absence_not_working":[{"type":"sleep disturbance and frequent headaches","duration":""}],"work_restrictions":[{"type":""}]},"past_medical_history":[{"disease_or_pathology":"High cholesterol","text_span":""}]},"recovery_time":[{"body_part":"chest, neck, and back","recovery_time_in_days":"3-4 weeks from 1st treatment date or 9 to 12 visits whichever comes first","text_span":""}]},"dates":{"accident_date":"3/20/2021","examination_date":"3/26/2021","next_examination_date":"04/09/2021","signing_date":"3/26/2021 4:54:17 PM"}}"""

repaired = json_repair.loads(jsonString)
output = json.dumps(repaired, indent=2)
with open("output.txt","w") as f:
    f.write(output)

Expected behavior
I guess the expected behavior should be that if the extra closing parenthesis is followed by a comma, the parser should infer that the very same bracket is mislocated.

Desktop (please complete the following information):

  • OS: Windows 11
  • Python Kernel: 3.11.1
  • IDE: VSCode

I am afraid that an LL parser can't fix that, the problem is that the first part of that string is totally valid and there is no way to know (when reading from the leftmost token) that there is more possibly valid json.

For example it's impossible to discern the following two cases:
{"field1":[{}]},"field2":[{}]} => {"field1":[{}],"field2":[{}]}
{"field1":[{}]},"field2" => {"field1":[{}]}

I checked also jsonlint.com and that tool as well suggest that the string should EOF like json_repair does.

Unfortunately it's just a limitation of using a leftmost parser.