New parser using json.loads aborts on some pages.

Question

New parser using json.loads aborts on some pages.

dethrophes opened this issue 7 years ago · 12 comments

The json parser doesn't seem to cope with some of the pages.
e.g.
https://www.audible.com/pd/B00IZOP8CI

Invalid \escape: line 11 column 263 (char 443)

specifically it seems to be the\)in the following.

, "description": "Do you know why…a mortgage is literally a death pledge? …why guns have girls’ names? …why salt is related to soldier? You’re about to find out…<?-‘mä-lä-ji-kän) is:*Witty (wi-te\): Full of clever humor*Erudite (er-?-dit): Showing knowledge*Ribald (ri-b?ld): Crude, offensiveThe Etymologicon is a ce strange underpinnings of the English language. It explains: How you get from “gruntled” to “disgruntled”; why you are absolutely right to believe that your meager salary barely covers “main of coffee shops in the world (hint: Seattle) connects to whaling in Nantucket; and what precisely the Rolling Stones have to do with gardening. "

Answer 1 · 2017-11-08T21:46:24.000Z

what a weird book description. Boy they've really snarfed things up with these new pages though.

Jeez, well - I suppose we could try to sanitize it before it gets to the json parser. But by the time you can do that we might as well go back to using the regex and search functions.

Answer 2 · 2017-11-08T22:23:59.000Z

Tests are looking good with the changes below. I just moved the data over to a regular variable and replaced the \ with nothing. If this gets out of hand, we might need a sub for sanitizing different things. Gonna commit this later.

if date is None :
for r in html.xpath('//script[contains (@type, "application/ld+json")]'):
page_content = r.text_content()
page_content = page_content.replace('\\', '')
json_data=json_decode(page_content)

Answer 3 · 2017-11-08T23:46:17.000Z

Commit is done and tested on my production box.

9a54621

Answer 4 · 2017-11-09T00:11:02.000Z

You can't really just remove all escapes, they are valid sequences.. so I'd be reluctant to include this change,

…

On Thu, Nov 9, 2017 at 12:46 AM, macr0dev ***@***.***> wrote: Commit is done and tested on my production box. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#21 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABbA8yBfbkCAXRfQQ51h70i8DmsR7QCmks5s0j1KgaJpZM4QXDot> .

Answer 5 · 2017-11-09T01:21:53.000Z

well, it was a short lived victory anyway. It's fixed the book you found it in - but I'm starting to see problems with other books. Both with and without the fix I came up with.

Answer 6 · 2017-11-09T04:06:15.000Z

OK, did some digging and found the other problem and made a few more changes.

I've switched from just removing the backslash to escaping it with another backslash.
I've discovered that some of the descriptions have a hidden '\n' in them. So I'm and just straight removing it.

I took the source from the pages and ran them through a JSON validator. Those new lines were definitely breaking things, they just aren't in every page. Escaping the backslash makes the validator happy so that seems to be good. Honestly I think the backslash appearing is a fluke from non-sanitized text making it's way into their database. I don't think they're actually trying to escape anything with that backslash. So we'll just have to watch for backslash problems in the future and address it another way if it comes up.

Answer 7 · 2017-11-09T05:37:59.000Z

.... well. I stumbled across some books that actually have escaped characters. Looks like I'm gonna have to write something the check for backslashes that aren't escaping a legit character that needs it and then escape just that one. Which puts us back to not eliminating NOR escaping all backslashes and looking for specific instances of invalid data....

Answer 8 · 2017-11-09T17:20:48.000Z

OK. I've resigned for the moment to just remove that special case of backslash-paren for that one book. The new line is still removed, but I'm out of ideas at the moment on how to further sanitize the data before handing it to json.... gonna have to mull on this one.

Oh, and I added ratings in from audible to plex in a separate commit this morning.

Answer 9 · 2017-11-09T21:49:42.000Z

OK. Finally found built a regex that will remove and backslash UNLESS is is immediately followed by a character that JSON needs to be escaped. That should at least put the 'escaping' issue to bed. Who knows what crazy characters will pop up next.

@dethrophes, want to throw in an opinion on this one?

0f5db94#diff-3ee84c02e62336e4581b8f124526c78b

Answer 10 · 2017-11-09T23:08:18.000Z

according to http://json.org/ I think this would be better might as well just compile it once. remove_inv_json_esc=re.compile(r'([^\\])(\\(?![bfnrt\'\"\\/]|u[A-Fa-f0-9]{4}))' ) page_content=remove_inv_json_esc.sub(r'\1\\\2', page_content) not perfect but close enough.

…

On Thu, Nov 9, 2017 at 10:49 PM, macr0dev ***@***.***> wrote: OK. Finally found built a regex that will remove and backslash UNLESS is is immediately followed by a character that JSON needs to be escaped. That should at least put the 'escaping' issue to bed. Who knows what crazy characters will pop up next. @dethrophes <https://github.com/dethrophes>, want to throw in an opinion on this one? 0f5db94#diff-3ee84c02e62336e4581b8f124526c78b <0f5db94#diff-3ee84c02e62336e4581b8f124526c78b> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#21 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABbA81nJGNchzdO1_iCWqjO8864wJznoks5s03N4gaJpZM4QXDot> .

Answer 11 · 2017-11-09T23:50:28.000Z

might as well just compile it once.

Is that supposed to take care of needing to remove the new lines also and remove the need for this line?

page_content = page_content.replace('\n', '')

It seems to work just as well for removing erroneous escapes as mine, but the new line slips by it.

Answer 12 · 2019-07-29T13:54:24.000Z

Cleaning up old issues. This one is long resolved. Closing.