A Docker-powered service for extracting Table of Contents information from PDF documents
This project aims to extract Table of Contents (TOC) information from PDF files using the outputs generated by the pdf-document-layout-analysis service. By leveraging the segmentation and classification capabilities of the underlying analysis tool, this project automates the process of identifying and structuring the document's TOC.
You can check the pdf-document-layout-analysis service from here:
https://github.com/huridocs/pdf-document-layout-analysis
Start the service:
# With GPU support
make start
# Without GPU support [if you do not have a GPU on your system]
make start_no_gpu
Get the segments from a PDF:
curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5070
To stop the server:
make stop
- Docker Desktop 4.25.0 install link
- 4 GB RAM memory
- 6 GB GPU memory (if not, it will run with CPU)
As we mentioned at the Quick Start, you can use the service simply like this:
curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5070
Also, if you want to get the results faster (but with slightly worse results) you can run this command:
curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5070/fast
For more information about models, check this link.
When the process is done, the output will include a list of TOCItem elements and, every TOCItem element will has this information:
{
"indentation": Level of indentation
"label": Content of the respective item
"selectionRectangles": List of rectangles for the respective item
}
And every selectionRectangle item will include this information:
{
"left": Left position of the rectangle
"top": Top position of the rectangle
"width": Width of the rectangle
"height": Height of the rectangle
"page": Page number which the rectangle belongs
}
And to stop the server, you can simply use this:
make stop