nlmatics/llmsherpa

read_pdf fails on specific pdf locally, not through hosted api

Ianpwest opened this issue · 5 comments

PDF in question:
JTR.pdf

This api call works great
llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all" pdf_url = "JTR.pdf" pdf_reader = LayoutPDFReader(llmsherpa_api_url) doc = pdf_reader.read_pdf(pdf_url)

This local call fails
llmsherpa_api_url = "[http://localhost:5010/api/parseDocument?renderFormat=all"](http://localhost:5010/api/parseDocument?renderFormat=all%22) pdf_url = "JTR.pdf" pdf_reader = LayoutPDFReader(llmsherpa_api_url) doc = pdf_reader.read_pdf(pdf_url)

The local version is running from the latest docker build. Other pdfs work fine. Is there a way to get a better error message? Currently receiving: KeyError: 'return_dict'

I noticed there are other issues open around this error but did not find any matching this case where it works on one and not the other.

I appreciate your time and any insight. Thanks!

Hey, @Ianpwest did you manage to solve this?

Hey, @Ianpwest did you manage to solve this?

@wolfassi123 No, there were also some other parsing issues with different character sets. The library is promising but seemingly under supported. No movement on my tickets.

Hello @Ianpwest, @wolfassi123, I have fixed the issue and seems to be working with the sample PDF provided here. Can you do a pull from the main branch of nlm-ingestor and verify?

Switching to the docker image for nlm-ingestor in this comment worked for me.

llmsherpa_api_url = "[http://localhost:5010/api/parseDocument?renderFormat=all"]

If you look neatly you'll see that the " and the [ are switched in order in the local call.

The error handling of a nonexistent renderFormat could be better.