nlmatics/llmsherpa

Not able to get all the subsection names inside a section

Amy-raj opened this issue · 2 comments

Hi,I am using the attached pdf for testing.There is no whitespace between subsection title and subsection content.It is not able to extract all the subsection titles present within a section.I tried with a different pdf where white space is there ,It was working pretty good.Could you please guide how we can extract specific subsection title along with corresponding content ?
RWXcE3.pdf.pdf

Hi Amy-raj,

The sections seem to parse quite well. You can get the first level sections by traversing through children of root and then get the next level of section by traversing through the children of each section. Hope this helps.
2023-12-20_08-26-58

Hi,I am using the below code for extraction of subsection and it is not able to extract all the subsections.For example for “TERM AND TERMINATION” section ,it is extracting only 4 subsections whereas 6 subsections are present.I am seeing this issue with many sections in the pdf.You can check the code and output below in the image.
IMG_2093