Articles-extractor

Description:

Articles-extractor is a project designed to extract figures and tables, along with their page numbers and bounding boxes, from PDF documents. The extracted data is presented in a tabular format using Streamlit for easy visualization.

Current Development Status:

The project is under active development, with a focus on addressing the following key issues:

Handling Multi-Block Pages: Enhancements are being made to handle pages with multiple blocks of content, ensuring accurate extraction of figures and tables.
Improving Layout Model and OCR Accuracy: Ongoing efforts are dedicated to improving the accuracy of the layout model and Optical Character Recognition (OCR) for precise figure and table extraction.

Project Structure:

The project comprises the following components:

PDF Extraction: Utilizes a PDF parsing module to extract content, bounding boxes, and page numbers from PDF documents.
Layout Model: Employs a layout model to identify figures and tables based on their formatting and layout.
OCR (Optical Character Recognition): Applies OCR to extract captions and labels associated with figures and tables.
Streamlit Web Application: Displays extracted figures and tables, along with corresponding page numbers and bounding boxes, in tabular format using Streamlit, providing an interactive user interface.

Requirements:

Ensure you have the following installed:

Python 3.7 or higher
Libraries: pandas, numpy, streamlit, Deepdoctection, Detectron2, Pandas, CGBoost, joblib, NLTK, XGBoost

Installation Guide:

Install Python:

Download and install Python 3.7 or a higher version from Python's official website.

Install Docker (Optional):

Follow the official Docker installation guide to install Docker.

Clone the Repository:

Open a terminal or command prompt.

Run the following command to clone the repository:

git clone https://github.com/your-username/Articles-extractor.git

Install Project Dependencies:

Navigate to the project directory:
```
cd Articles-extractor
```
Run the following command to install the required libraries:
```
pip install -r requirements.txt
```

Usage with Docker:

Run the following command to build the Docker image:
```
docker-compose build
```
After the build is complete, use the following command to run the application:
```
docker-compose up
```
Access the application in your web browser at http://localhost:8501.
Upload your PDF documents to the application.
The application will display tables containing figures and tables, along with their respective page numbers and bounding boxes.

Note:

Make sure the files uploaded are in the same folder as the application.

Author:

Soulala Achraf | Email: achrafs758@gmail.com

License:

This project is licensed under the MIT License - see the LICENSE file for details.

achrafs758/Articles-extractor