Optical character recognition (OCR) is a subset of machine vision technology that focuses on recognizing written letters and characters and reproducing them digitally for later use. This opens up many possibilities for banking industry, including security solutions, and document digitization. In this project, we parsed the transactions in bank statements from PDF files into excel files using Camelot library. Also, we added other important informations from the documents using Tensorflow Object Detection API and Google Tesseract.
- Install python 3.7
- We used CUDA 11.2 and it is only compatible with Tensorflow 2.5.0
- Please make sure you install the correct CUDA, & CUDNN for your machine with the correct Tensorflow version
- Refer this table for confirmation https://www.tensorflow.org/install/source#gpu
- git clone https://github.com/tzutalin/labelImg.git
- git clone https://github.com/tensorflow/models.git
- conda install cudatoolkit
- pip install -r 'requirements.txt'
- obtain 'test_cimb' from repo owner
- Install poppler from here: https://blog.alivate.com.au/poppler-windows/
- Add bin location to PATH
- Install Ghostscript from here: https://blog.alivate.com.au/poppler-windows/
- Copy all files from '..\models\research\object_detection\protos' to '..\anaconda3\envs<env_name>\Lib\site-packages\object_detection\protos'
- At line 67, change 'object-detection @ file:///../Bank_Statement_Digitization/train/models/research' accordingly
- Copy setup.py from '..\models\research\object_detection\packages\tf2' to '..\models\research'
- Ensure the PDF files are all text based
k) Download(https://drive.google.com/drive/folders/1eIoO2t0J5YVVkJJhK9Srbktz9N8kW_5A?usp=sharing) Maybank model (from gdrive) and put in eagleye_ocr/bank_ocr
l) Download(https://drive.google.com/drive/folders/1xNLgTlmYXtCoQ6bMPjAEm9lyLBo6WuQR?usp=sharing) Tesseract-OCR (from gdrive) and put in C:/Program Files/
- Before running the scripts, ensure that all paths are correctly defined
- Run preprocess_images.py
- Run label_images.py
- Run augment_images.py
- Run split_dataset.py
- Run create_csv_file.py
- Run create_tf_records.py
- Run download_model.py
- Run configure_settings.py
- Run train.py
- Run valid.py
- Run tensorboard.py
- Run test.py
- Run pdf_extract_table.py
- Run extract_metadata.py