This project is a document classification system designed to identify and categorize documents into pre-defined categories. The classifier supports Invoices, Bank Statements, and Driver's Licenses. It leverages a basic Logistic Regression machine learning model and a Flask-based REST API to handle file uploads, extract text, and classify documents.
Video Demo: Document Classifier Demo
- Flask API: Provides endpoints to classify uploaded files.
- Text Extraction: Uses PyMuPDF, Tesseract OCR, and other libraries to extract text from PDFs, images, DOCX, and XLSX files.
- Classification Model: A Logistic Regression model trained with TF-IDF features.
- Logging: Implements structured logging for debugging and auditing purposes.
- File Types Supported:
- PDFs (
.pdf
) - Images (
.jpg
,.png
) - Word Documents (
.docx
) - Excel Spreadsheets (
.xlsx
)
- PDFs (
Prerequisites
-
Install Python 3.9+ (this project was developed with Python 3.12.3)
-
Install dependencies:
pip install -r requirements.txt
-
Ensure
Terract OCR
is installed on your system.- Ubuntu:
sudo apt-get install tesseract-ocr
- MacOS:
brew install tesseract
- Windows:
- Download the installer from the official website
-
Run the Flask app:
- Start the Flask server:
python -m src.app
- Access the API at
http://127.0.0.1:5000/classify_file
API Endpoint
- POST /classify_file
Headers
- Content-Type: multipart/form-data
Request Body
- file: (file) The file to classify
curl -X POST http://127.0.0.1:5000/classify_file \
-F "file=@./examples/sample_invoice.pdf"
Response
{
"file_class": "invoices"
}
-
Run the tests:
pytest
-
Run Unit Tests:
pytest tests/unit
-
Run E2E Tests:
pytest tests/e2e
-
Run Tests with markers:
pytest -m slow pytest -m fast
This project uses GitHub Actions for Continuous Integration.
The CI pipeline runs the following steps:
- Install dependencies.
- Run tests using pytest.
Triggering CI CI is triggered automatically on every push or pull request to the repository.
Project Logs
Logs are saved in the logs/
directory:
app.log
: Logs related to Flask API.
classifier.log
: Logs for the document classification process.
file_io.log
: Logs for text extraction.
While the project demonstrates a functional document classification pipeline, there are several limitations and areas for potential enhancement:
- Synthetic Data: The model is trained predominantly on synthetic data, which may not reflect real-world document diversity. This could lead to poor performance on actual data due to the lack of variability in text patterns, layouts, and noise.
- Recommended Actions:
- Train the model on a dataset containing real-world examples of invoices, bank statements, driver's licenses and various other types of documents in different formats beyond
.pdf
,.docx
,.jpg/png
andxlsx
. - Ensure a diverse dataset to minimize bias and account for various formats, languages, and noise levels.
- Evaluate the model for bias and overfitting by examining metrics like the confusion matrix and precision/recall for each class.
- Use feature importance analysis to determine whether critical document elements (e.g., headers, specific keywords) are being captured effectively.
- Explore pre-trained models or Large Language Models (LLMs) (e.g., GPT, Llama) for feature extraction and classification to improve performance.
- Test on documents from various industries and regions to ensure the model's generalizability.
- Train the model on a dataset containing real-world examples of invoices, bank statements, driver's licenses and various other types of documents in different formats beyond
- Unvalidated Performance: The API and text extraction modules have not been tested for handling large files (e.g., PDFs with hundreds of pages or complex Excel sheets). Memory and performance bottlenecks might occur.
- Recommended Actions:
- Implement file size limits for the API to avoid server crashes.
- Optimize text extraction methods to process large files efficiently by:
- Streaming file content.
- Limiting the number of pages or rows processed.
- Using parallelized text extraction.
- Also include Mime type validation for the uploaded files to ensure that only the supported file types are processed.
- Lack of Strict Validation: The API does not validate the structure of incoming requests beyond checking file types.
- Recommended Actions:
- Use libraries like Pydantic to enforce strict input validation and ensure robust payload handling.
- Define and validate schemas for uploaded files, including expected formats, sizes, and metadata.
- Happy Path Focus: Current tests primarily cover successful cases (e.g., correctly formatted files).
- Recommended Actions:
- Write additional unit tests to cover edge cases and failure scenarios, such as:
- Unsupported file formats.
- Corrupted or empty files.
- Files with noise, overlapping text, or unusual layouts.
- Include integration and end-to-end tests to validate the pipeline holistically.
- Ensure tests include large-scale inputs and unusual document formats.
- Write additional unit tests to cover edge cases and failure scenarios, such as:
- Missing Database Integration: The project does not store metadata or logs in a database for auditing, tracking, or analysis.
- Recommended Actions:
- Integrate a database (e.g., SQLite, PostgreSQL) to store:
- Uploaded file metadata (e.g., name, size, type).
- Classification results.
- Logs for auditing and debugging.
- Use an ORM (Object-Relational Mapping) library like SQLAlchemy to interact with the database.
- Integrate a database (e.g., SQLite, PostgreSQL) to store:
This appendix provides a step-by-step guide on labeling data, training the classification model, and generating synthetic data for this project.
The label_data.py
script is used to label and preprocess files for training the classifier. It processes files in a directory structure where each subdirectory corresponds to a class label.
-
Organize Files:
- Create a directory structure like the following:
training_data/ ├── bank_statements/ │ ├── file1.pdf │ ├── file2.jpg ├── invoices/ │ ├── file3.pdf │ ├── file4.png ├── drivers_licenses/ ├── file5.pdf ├── file6.jpg
- Each subdirectory name (e.g.,
bank_statements
,invoices
,drivers_licenses
) is treated as the label for the files it contains.
- Create a directory structure like the following:
-
Run the Script:
- Use the
label_data.py
script to process and label the data:python src/label_data.py
- By default, the script will:
- Extract text from files using the
file_io
module. - Preprocess the text for training.
- Save the labeled dataset to a file named
dataset.json
.
- Extract text from files using the
- Use the
-
Verify the Dataset:
- The output dataset will look like this:
[ ["Sample text from bank statement file", "bank_statements"], ["Sample text from invoice file", "invoices"], ["Sample text from driver's license file", "drivers_licenses"] ]
- The output dataset will look like this:
The train_model.py
script is used to train the document classifier.
-
Prepare the Dataset:
- Ensure
dataset.json
is present in the root directory. This file is generated by thelabel_data.py
script.
- Ensure
-
Run the Training Script:
- Use the following command to train the model:
python src/train_model.py
- The script will:
- Load and preprocess the dataset.
- Split the data into training, validation, and test sets.
- Train a
LogisticRegression
model using TF-IDF features. - Save the trained model and vectorizer to the
./src/models/
directory.
- Use the following command to train the model:
-
Check the Training Results:
- The script outputs metrics such as precision, recall, and F1-score for the validation and test sets.
- Example output:
Validation Set Results: precision recall f1-score support bank_statements 0.98 1.00 0.99 102 drivers_licenses 1.00 1.00 1.00 95 invoices 0.97 0.98 0.98 104 Test Set Results: precision recall f1-score support bank_statements 0.99 1.00 0.99 103 drivers_licenses 1.00 0.99 0.99 95 invoices 0.98 0.97 0.98 103
The project provides scripts to generate synthetic data for training and testing purposes.
-
Invoices:
- Use the
synthetic_invoices.py
script:python src/data_gen/synthetic_invoices.py
- This script generates synthetic invoices in PDF, DOCX, and XLSX formats. Adjust the
count
parameter to generate the desired number of files.
- Use the
-
Bank Statements:
- Use the
synthetic_bank_statements.py
script:python src/data_gen/synthetic_bank_statements.py
- This script generates synthetic bank statements in PDF, DOCX, and XLSX formats.
- Use the
-
Driver's Licenses:
- Use the
synthetic_drivers_licenses.py
script:python src/data_gen/synthetic_drivers_licenses.py
- This script generates driver's licenses in JPG format with random details such as name, address, and license number.
- Use the
Thank you! 🚀