Bug in API function: Incorrect behavior with repeated sections.

Question

Bug in API function: Incorrect behavior with repeated sections.

Opened this issue 5 months ago · 8 comments

The issue arises when extracting HTML content from a document using the .to_html() method after reading a PDF with

doc = pdf_reader.read_pdf(pdf_url)
doc.to_html(include_children=True, recurse=True)

When iterating through the sections, the loop processes both the parent and child sections, causing repetitive content in the HTML output.
Resulting in unintended duplication.

Here is the relevant code:

    def to_html(self):
        """
        Returns html for the document by iterating through all the sections
        """
        html_str = "<html>"
        for section in self.sections():
            html_str = html_str + section.to_html(include_children=True, recurse=True)
        html_str = html_str + "</html>"
        return html_str

Answer 1 · 2024-01-26T14:13:10.000Z

This should not happen since we are only going through first level of sections where each section is distinct and then for each section traversing all the way to the end. Can you give an example.

Answer 2 · 2024-01-27T11:25:16.000Z

from llmsherpa.readers import LayoutPDFReader
llmsherpa_api_url = 'https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all&useNewIndentParser=true&applyOcr=yes'
pdf_reader = LayoutPDFReader(llmsherpa_api_url)
doc = pdf_reader.read_pdf('https://classic.clinicaltrials.gov/ProvidedDocs/96/NCT01593696/Prot_SAP_000.pdf')
HTML(doc.to_html())

In the output, sections repeat.

Answer 3 · 2024-01-27T12:24:17.000Z

I started my own server given the instructions but
when reading
doc = pdf_reader.read_pdf('https://classic.clinicaltrials.gov/ProvidedDocs/96/NCT01593696/Prot_SAP_000.pdf')
I received the following error:
347, in render_json table_block["left"], KeyError: 'left'

Any insights on how to address this would be greatly appreciated.

Answer 4 · 2024-02-02T05:29:06.000Z

@ansukla Hi, is there any update? Thanks.

Answer 5 · 2024-02-19T07:33:49.000Z

Facing same issue of repeated section. I had to post-process it to truncate the html to avoid repetition, but that approach is not that efficient. Its better to directly get exact extraction to html with no repetition from llm-sherpa to avoid unnecessary problems in production.

Answer 6 · 2024-04-22T23:02:33.000Z

Same is happening to me.
Both to_text and to_html repeat sections in the output

Answer 7 · 2024-04-25T09:21:55.000Z

I'm facing the same issue with Document.to_text(). I posted my findings and solution in #73 .

Answer 8 · 2024-05-20T05:48:35.000Z

I started my own server given the instructions but when reading doc = pdf_reader.read_pdf('https://classic.clinicaltrials.gov/ProvidedDocs/96/NCT01593696/Prot_SAP_000.pdf') I received the following error: 347, in render_json table_block["left"], KeyError: 'left'

Any insights on how to address this would be greatly appreciated.

Seems the same issue as reported by nlmatics/nlm-ingestor#18