Dog finder
A small project to find the word "dog" in PDFs with descriptions of medical experiments (on animals). The data comes from https://eagri.cz.
Dependencies
- Python 3.6+
sips
to convert PDF to PNG- tesseract with Czech language.
- spacy
How to use it
The idea is:
- Convert PDF to PNG (because a PDF file contains a picture)
- Run
tesseract
and get the text fromt he image. - Tokenize the text with
spacy
. - Find the word
pes
. - Write the results into
analysis.csv
.
Get processed text
. ./convert.sh <folder_with_pdfs>
Calling
tesseract
directly worked better than Python wrapper during experimenting.
Analyse the data
python analyse.py