Doc Matching
Similar Document Template Matching Algorithm

Doc Matching - A Similar Document Template Matching ML Model to detect fraudulent documents for insurance claims
Pre - Smart India Hackathon '23 VJTI - Team TechnoSrats

Table of Contents

📝Description

Fraud transactions and invoices are serious problems in the financial services and insurance industries, KPMG reported over a billion dollars in losses due to fraudulent transactions. Thousands of man hours are lost each year to tedious manual checking of invoices and documents to confirm their validity. Extraction of standard information common to most insurance related documents is also required, with the advent of advanced computer vision and object detection models the automation of these odious tasks has become possible

The key features are:
  • Detect and extract standardised fields of information from important documents such as -
    • Invoice number
    • Total amount
    • Personal details of the claimant
  • Check for common markers of fake invoices such as
    • Minor changes to details of the invoice like changing the colour of a logo, changing the date of issue, changing the name of the claimant etc.
    • Grammatical errors
    • Changing positions of the tables or service provider details of the invoice
  • Flag the fraudulent/suspicious documents with red and amber colours respectively on the Dashboard
  • Detect and group patterns in existing and new Documents, present the related templates and patterns in a clustering graph chart visually
  • Problem Statement ID: SIH1441
  • Problem Statement Title: Similar Document Template Matching Algorithm from Bajaj Finserv Health Ltd

Flowcharts

Modern Project Management Process Infographic Graph Minimalist White Colorful Project Management Process Infographic Graph

🔗Links

Assets

Backend (Hasura and Render)

🤖Tech-Stack

Web Development

  • NextJS
  • Material UI

Database

  • PostgreSQL (using Supabase)

APIs

  • Hasura GraphQL API (over the Postgres DB)
  • FastAPI (for the model)

Machine Learning

  • Tensorflow (for Deep-Learning based Bounding Box model)
  • Scikit-Learn (for NLP-based Named Entity Recognition)

🛠Project Setup

For the web-app

  1. Clone the GitHub repo
    $ git clone https://github.com/saRvaGnyA/similar-doc-matching.git
    
  2. Enter the client directory. Install all the required dependencies. Ensure that remove any globally-installed packages like the React CLI, Tailwind CLI, PostCSS CLI or ESLint are uninstalled before proceeding ahead
    $ cd client
    $ yarn add
    
  3. Setup the .env file for storing the environment variables. A demo file for this is as follows:
    NEXT_PUBLIC_HASURA_ADMIN_SECRET = your hasura admin key
    NEXT_PUBLIC_SUPABASE_ANON_KEY = your supabase anon key
    NEXT_PUBLIC_SUPABASE_URL = your supabase public url
    
  4. If you are working on Visual Studio Code or WebStorm, it'd be convenient to install the extensions for Prettier and ESLint.

For the model

  1. Clone the GitHub repo
    $ git clone https://github.com/saRvaGnyA/similar-doc-matching.git
    
  2. Create a virtual environment on the anaconda command prompt (Install conda if not installed) and then switch to that virtual environment. Lets say the name of the env is test.
    $ conda create -n test python=3.8 anaconda
    $ conda activate test
    
  3. Look for requirments.txt and install the packages.
    $ pip install -r requirements.txt
    

For the FastAPI

  1. Look for the main.py and utils.py files and have them ready. (The packages for FastAPI would already be installed when you run command number 3 in the above section)

💻Usage

Once the required setup and installation is completed, you can start developing and running the project.

For the web-app

  1. Go to the frontend directory and run the dev script to activate the development server
    $ npm run dev
    
    Before pushing any commit, make sure to run the lint script and fix any linting errors
    $ npm run lint
    
    If you get an ESLint, Tailwind or PostCSS version conflict error, make a .env file in the client directory with the following contents:
    SKIP_PREFLIGHT_CHECK = true
    

For the model and for the FastAPI

  1. Locate to the Model directory. The models for the project are in gesture_model.tflite file.

  2. Open the command prompt for anaconda and switch to the virtual environment that you created. (example: test)

    $ conda activate test
    
  3. To initiate the server, type the following in the command prompt

    $ python main.py
    

👩‍💻Team Members