An application that extract meaningful data from any type of files.
For end users.
Currently in progress to set up an environment
- Upload a file using the frontend.
- Tesseract will extract the texts available in the file uploaded.
For developers.
The application has a number of dependencies. Kindly ensure you have the following installed on your machine:
- Python
- Python packages (Complete details provided below)
- Mongo
- Mongodb compass(optional , alternatives available)
- Tesseract
- Git
-
Python
-
Tesseract
-
Mongo
-
Compass
-
Git
- Install Python if it is not installed already. Add the environment variables and check version.
C:\Users\username> python Python 3.8.5 (tags/v3.8.5:580fbb0, Jul 20 2020, 15:43:08) [MSC v.1926 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information.
- Install Mongodb if it is not installed already.
- Install Mongodb compass. ( Client )
- Go to Mongo db bin folder and run the server
C:\Program Files\MongoDB\Server\4.4\bin> mongod
It will be available in port 27017
- Go to compass get in to the db
mongodb://localhost:27017
-
Install Tesseract
-
Clone the repository
git clone https://github.com/SandeepBalachandran/Pytheract.git
- Check into the cloned repository
cd Pytheract
- If you are using Pipenv, setup the virtual environment and start it as follows:
pipenv install
- Run Flask
set FLASK_APP=app.py set FLASK_ENV=development flask run
It will be available in port 5000
- Extraction texts from pdf files.
- Extraction texts from zip files contains both images and pdf files.
- Get webcam on UI.
- Capture image/ extract texts from captured image.
- Using regex locate specific contents . For eg: Email address, Phone number etc
Please check the Contributing Guidelines before contributing.