ocr-cpi-covid19
For more details, follow the series in my blog
(Coletando os dados da CPI - Parte I
Requirements
imagemagick
tesseract
brew
(For MacOS)Node v14
npm
oryarn
Install
npm install
Process files
PDF Images
Some PDFs are images, so they need conversion before pass on OCR, to facilitate that it I created the convert-pdf-images
,
STEP 1
Run node convert-pdf-images
, so you can generate PNGs from it, after pass filelocation with filename
STEP 2
Run node ocr.js
after change the json for filelist