CrossRef/pdfextract

Section Breakdown

netconstructor opened this issue · 1 comments

I am trying to wrap my head around this utility but it seems that it is unable to determine sections within a pdf. For example, lets say that one is looking to extract information from documents which contain blocks of text/paragraphs where each of these content blocks either has a title. These sections could be defined by larger text titling the section, might be in upper case, might be italic, might be underlined... or any combination of those elements.

So, what i am look for is a way to somehow get this utility to determine such a pattern and return the content of the document and annotate each of these sections with corresponding tree pattern markers.

How would one go about this?

kjw commented

This tool attempts to detect the flow of text along a page. To do this it will try to undertand columns within a page, and discard any blocks of text that appear not to follow the bounds of detected columns (to avoid including text from cut outs, figures, figure descriptions, etc, in the page flow.)

The tool will then try to pick out what look like headers, and use these to deliniate the page flow into sections. This part of the tool is not well developed - it was still quite error prone when I stopped work on this took a few years ago.

To extract certain types of objects, (say, sections,) you will need to include the relevant command line option (--sections, etc.)