Document Similarity Matching

Overview

This project implements a system for matching and categorizing invoices based on their content and structure. It extracts text from PDF invoices, preprocesses the text, extracts relevant features, and calculates similarity scores to identify the most similar invoice from a database.

Project Structure

Requirements

Ensure you have Python 3.x installed. You will also need the following Python libraries:

  • pdfplumber
  • nltk
  • scikit-learn

These can be installed using the requirements.txt file.

Installation

  1. Clone the Repository:

    git clone <repository_url>
    cd document_similarity_matching