mangiucugna/json_repair

Issue with parsing when there is leading text

lujasmine opened this issue · 2 comments

Describe the bug
Issue with parsing when there is leading text

To Reproduce

json_repair.loads("Based on the information extracted, here is the filled JSON output: ```json { 'a': 'b' } ```")
# this returns the same string inputted to the function

Expected behavior
It returns { 'a': 'b' }

I've noticed that the repair works well with trailing text, e.g.

json_repair.loads("```json { 'a': 'b' } ``` This output reflects the information given in the input.")
# returns {'a': 'b'} as expected

Hi @lujasmine thanks for reporting this issue.
Your issue triggered a deeper look into how the library handles stray characters and I am releasing 0.16.0 with better handling of those cases and that will fix your case.

However, I don't think this is the right way to handle this on your side. Most LLMs are trained with those token exactly to allow you to isolate the json part of the message and do string manipulation to isolate the relevant part.
I have also seen people just removing anything before and after {} if the expect an object.

Up to you of course, but the more you can clean on your side the cleaner the final json

That makes sense, I will make sure to do more cleaning on my side!

Thank you so much @mangiucugna!