read_pdf fails on specific pdf locally, not through hosted api

Question

read_pdf fails on specific pdf locally, not through hosted api

Ianpwest opened this issue 6 months ago · 5 comments

PDF in question:
JTR.pdf

This api call works great
llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all" pdf_url = "JTR.pdf" pdf_reader = LayoutPDFReader(llmsherpa_api_url) doc = pdf_reader.read_pdf(pdf_url)

This local call fails
llmsherpa_api_url = "[http://localhost:5010/api/parseDocument?renderFormat=all"](http://localhost:5010/api/parseDocument?renderFormat=all%22) pdf_url = "JTR.pdf" pdf_reader = LayoutPDFReader(llmsherpa_api_url) doc = pdf_reader.read_pdf(pdf_url)

The local version is running from the latest docker build. Other pdfs work fine. Is there a way to get a better error message? Currently receiving: KeyError: 'return_dict'

I noticed there are other issues open around this error but did not find any matching this case where it works on one and not the other.

I appreciate your time and any insight. Thanks!

Answer 1 · 2024-04-16T07:29:56.000Z

Hey, @Ianpwest did you manage to solve this?

Answer 2 · 2024-04-16T12:50:40.000Z

Hey, @Ianpwest did you manage to solve this?

@wolfassi123 No, there were also some other parsing issues with different character sets. The library is promising but seemingly under supported. No movement on my tickets.

Answer 3 · 2024-04-17T16:52:28.000Z

Hello @Ianpwest, @wolfassi123, I have fixed the issue and seems to be working with the sample PDF provided here. Can you do a pull from the main branch of nlm-ingestor and verify?

Answer 4 · 2024-07-16T11:34:49.000Z

Switching to the docker image for nlm-ingestor in this comment worked for me.

Answer 5 · 2024-07-19T23:02:32.000Z

llmsherpa_api_url = "[http://localhost:5010/api/parseDocument?renderFormat=all"]

If you look neatly you'll see that the " and the [ are switched in order in the local call.

The error handling of a nonexistent renderFormat could be better.