This repo automatically extracts text in a given directory for several file formats (pdf,ppt,docx,txt) and saves them in a JSON file. The JSON file has several layers, the first layer has an index as key (counter for the number of files). The second layer has two keys (title and data) referencing the name of the document and its content. The third layer has a single key either referencing the page nb, paragraph nb or slide nb.
To run this code start by installing the packages with the following command in your terminal :
python3 -m pip install -r requirements.txt
Then to launch the code in your current directory, save your file in the nested folder /data/data/ (this format is convenient for the project but unconvenient for other purposes, is bound to change) and launch with the following command in your terminal:
python3 main.py
You also need to install tesseract on your machine
On Linux
sudo apt-get update
sudo apt-get install tesseract-ocr
sudo apt-get install libtesseract-dev
On Mac
brew install tesseract
On Windows
download binary from https://github.com/UB-Mannheim/tesseract/wiki.
Then add to your script:
pytesseract.pytesseract.tesseract_cmd = 'C:\Program Files (x86)\Tesseract-OCR\tesseract.exe'