/Pytheract

Tool for extracting data from files.

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

Pytheract

GitHub release GitHub last commit GitHub commit activity GitHub issues GitHub pull requests GitHub contributors
ESLint ESLint ESLint Codesize Top Language


Optical character recognition using tesseract

Table of contents

Introduction

An application that extract meaningful data from any type of files.

Usage

For end users.

Currently in progress to set up an environment


Flow

  • Upload a file using the frontend.
  • Tesseract will extract the texts available in the file uploaded.

Installation

For developers.

Prerequisites

The application has a number of dependencies. Kindly ensure you have the following installed on your machine:

  • Python
  • Python packages (Complete details provided below)
  • Mongo
  • Mongodb compass(optional , alternatives available)
  • Tesseract
  • Git

  • Python

  • Tesseract

  • Mongo

  • Compass

  • Git


    Running the Application

    1. Install Python if it is not installed already. Add the environment variables and check version.
      C:\Users\username> python
      Python 3.8.5 (tags/v3.8.5:580fbb0, Jul 20 2020, 15:43:08) [MSC v.1926 32 bit (Intel)] on win32
      Type "help", "copyright", "credits" or "license" for more information.
    1. Install Mongodb if it is not installed already.
    2. Install Mongodb compass. ( Client )
    3. Go to Mongo db bin folder and run the server
    C:\Program Files\MongoDB\Server\4.4\bin> mongod

    It will be available in port 27017

    1. Go to compass get in to the db
      mongodb://localhost:27017
    1. Install Tesseract

    2. Clone the repository

    git clone https://github.com/SandeepBalachandran/Pytheract.git
    1. Check into the cloned repository
    cd Pytheract
    1. If you are using Pipenv, setup the virtual environment and start it as follows:
    pipenv install 
    1. Run Flask
    set FLASK_APP=app.py
    set FLASK_ENV=development
    flask run 

    It will be available in port 5000

Features

  • Extraction texts from pdf files.
  • Extraction texts from zip files contains both images and pdf files.
  • Get webcam on UI.
  • Capture image/ extract texts from captured image.
  • Using regex locate specific contents . For eg: Email address, Phone number etc

Contribute

Please check the Contributing Guidelines before contributing.