This project provides a REST API to extract structured text from PDF files, including:
- Text Content
- Font Size
- Font Color (HEX)
- Bold & Italic Styles
- Text Position (X, Y, Width, Height)
- Page Dimensions
The API processes PDFs using Node.js (Express.js) and calls a Python script (pdfminer.six
) to extract detailed font information.
git clone https://github.com/EL-Mehdiri/PDF-Text-Extraction-API.git
cd pdf-extractor-api
npm install
Ensure you have Python 3 installed. Then create a virtual environment and install dependencies:
python3 -m venv venv # Create virtual environment
source venv/bin/activate # Activate (Linux/macOS)
venv\Scripts\activate # Activate (Windows)
pip install -r requirements.txt # Install dependencies
node src/app.js
Endpoint: POST /api/pdf/upload
curl -X POST -F "pdfFile=@sample.pdf" http://localhost:5100/api/pdf/upload
{
"success": true,
"data": {
"0": {
"height": 841.89,
"width": 595.276,
"data": {
"1": {
"text": "Retirement Plan",
"left": 50,
"top": 700,
"end_left": 250,
"end_top": 730,
"font_size": 14,
"font_color": "#000000",
"is_bold": true,
"is_italic": false
}
}
}
}
}
pdf-extractor-api/
├── src/
│ ├── controllers/
│ │ ├── pdfController.js
│ ├── services/
│ │ ├── pdfService.js
│ ├── routes/
│ │ ├── pdfRoutes.js
│ ├── middleware/
│ │ ├── multerMiddleware.js
│ ├── scripts/
│ │ ├── extractPdf.py
│ ├── app.js
├── uploads/
├── package.json
├── requirements.txt
├── .gitignore
└── README.md
Create a .env
file in the project root:
PORT=5100
npm install -g nodemon
nodemon src/app.js
node src/app.js &
Error: Cannot find module 'dotenv'
npm install dotenv
Error: Python script not found
Ensure extractPdf.py
is in src/scripts/
and requirements.txt
is installed.
pip install -r requirements.txt
Error: Module 'pdfminer.six' not found
pip install pdfminer.six
This project is licensed under the MIT License.