This project implements a system for matching and categorizing invoices based on their content and structure. It extracts text from PDF invoices, preprocesses the text, extracts relevant features, and calculates similarity scores to identify the most similar invoice from a database.
Ensure you have Python 3.x installed. You will also need the following Python libraries:
pdfplumber
nltk
scikit-learn
These can be installed using the requirements.txt
file.
-
Clone the Repository:
git clone <repository_url> cd document_similarity_matching