Articles-extractor is a project designed to extract figures and tables, along with their page numbers and bounding boxes, from PDF documents. The extracted data is presented in a tabular format using Streamlit for easy visualization.
The project is under active development, with a focus on addressing the following key issues:
-
Handling Multi-Block Pages: Enhancements are being made to handle pages with multiple blocks of content, ensuring accurate extraction of figures and tables.
-
Improving Layout Model and OCR Accuracy: Ongoing efforts are dedicated to improving the accuracy of the layout model and Optical Character Recognition (OCR) for precise figure and table extraction.
The project comprises the following components:
-
PDF Extraction: Utilizes a PDF parsing module to extract content, bounding boxes, and page numbers from PDF documents.
-
Layout Model: Employs a layout model to identify figures and tables based on their formatting and layout.
-
OCR (Optical Character Recognition): Applies OCR to extract captions and labels associated with figures and tables.
-
Streamlit Web Application: Displays extracted figures and tables, along with corresponding page numbers and bounding boxes, in tabular format using Streamlit, providing an interactive user interface.
Ensure you have the following installed:
- Python 3.7 or higher
- Libraries: pandas, numpy, streamlit, Deepdoctection, Detectron2, Pandas, CGBoost, joblib, NLTK, XGBoost
- Download and install Python 3.7 or a higher version from Python's official website.
- Follow the official Docker installation guide to install Docker.
-
Open a terminal or command prompt.
-
Run the following command to clone the repository:
git clone https://github.com/your-username/Articles-extractor.git
-
Navigate to the project directory:
cd Articles-extractor
-
Run the following command to install the required libraries:
pip install -r requirements.txt
-
Run the following command to build the Docker image:
docker-compose build
-
After the build is complete, use the following command to run the application:
docker-compose up
-
Access the application in your web browser at
http://localhost:8501
. -
Upload your PDF documents to the application.
-
The application will display tables containing figures and tables, along with their respective page numbers and bounding boxes.
Make sure the files uploaded are in the same folder as the application.
Soulala Achraf | Email: achrafs758@gmail.com
This project is licensed under the MIT License - see the LICENSE file for details.