SandeepBalachandran/Pytheract

Tool for extracting data from files.

PythonGPL-3.0

Pytheract

Optical character recognition using tesseract

Table of contents

Introduction
Usage
Installation
Features to include (Help needed)
Contribute

Introduction

An application that extract meaningful data from any type of files.

Usage

For end users.

Currently in progress to set up an environment

Flow

Upload a file using the frontend.
Tesseract will extract the texts available in the file uploaded.

Installation

For developers.

Prerequisites

The application has a number of dependencies. Kindly ensure you have the following installed on your machine:

Python
- Official download.
Tesseract
Mongo
- Official download.
Compass
- Official download.
Git
- Official download.
Running the Application
1. Install Python if it is not installed already. Add the environment variables and check version.
```
  C:\Users\username> python
  Python 3.8.5 (tags/v3.8.5:580fbb0, Jul 20 2020, 15:43:08) [MSC v.1926 32 bit (Intel)] on win32
  Type "help", "copyright", "credits" or "license" for more information.
```
1. Install Mongodb if it is not installed already.
2. Install Mongodb compass. ( Client )
3. Go to Mongo db bin folder and run the server
```
C:\Program Files\MongoDB\Server\4.4\bin> mongod
```
It will be available in port 27017
1. Go to compass get in to the db
```
  mongodb://localhost:27017
```
1. Install Tesseract
2. Clone the repository
```
git clone https://github.com/SandeepBalachandran/Pytheract.git
```
1. Check into the cloned repository
```
cd Pytheract
```
1. If you are using Pipenv, setup the virtual environment and start it as follows:
```
pipenv install 
```
1. Run Flask
```
set FLASK_APP=app.py
set FLASK_ENV=development
flask run 
```
It will be available in port 5000

Features

Extraction texts from pdf files.
Extraction texts from zip files contains both images and pdf files.
Get webcam on UI.
Capture image/ extract texts from captured image.
Using regex locate specific contents . For eg: Email address, Phone number etc

Contribute

Please check the Contributing Guidelines before contributing.