OCR Image Parser: The script takes images and PDFs and parses the text in json. The data folder contains input and output folders.
The script iterates over the files and processes differently.
When it encounters an image file (.jpeg or .png) it is processed by OCR using Tesseract.
The data/input folder when contains .pdf files these are parsed using Camelot.
The output files are stored in data/output folder as filename.json
Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)