FileStruct is a high-level Python library that aims to extract the overall structure of documents, particularly PDFs, based on visual information such as size, color and font.
As clever human beings, we are able to detect titles, subtitles, and paragraphs using the visual appearence of the document. A big text in red most certainly represent a title (or subtitle). Using these heuristics, we are able to structure a document : This paragraph belongs to this section. The same method is used by this package to provide an automated, while realistic way to structure a document. The method is described bellow :
- Text and style extraction : We rely on lower level librairies (like PyMuPDF) for the extraction of the text and style information, and the ordering of each block of text.
- Tree creation : A tree is created, in which each block of text is a node of the tree. A child of a node in the tree is a subsection of a section in the document.
- Data exportation : The data can be exported in JSON format.
For now, filestruct can only read formats that are supported by PyMuPDF. This includes pdf, epub, xps, mobi, fb2, cbz and svg. I plan to add more file formats in the future.
Install FileStruct using pip :
pip install filestruct
Bellow, a basic usage for a PDF document :
from filestruct.document import PDFDocument
doc = Document("PATH_TO_YOUR_FILE.pdf")
data = doc.to_json() # Export the tree into json format
print(data)
print(doc)