pstanisl/find-dogs

A small project to find the word "dog" in PDFs with descriptions of medical experiments (on animals).

Python

Dog finder

A small project to find the word "dog" in PDFs with descriptions of medical experiments (on animals). The data comes from https://eagri.cz.

Dependencies

Python 3.6+
sips to convert PDF to PNG
tesseract with Czech language.
spacy

How to use it

The idea is:

Convert PDF to PNG (because a PDF file contains a picture)
Run tesseract and get the text fromt he image.
Tokenize the text with spacy.
Find the word pes.
Write the results into analysis.csv.

Get processed text

. ./convert.sh <folder_with_pdfs>

Calling tesseract directly worked better than Python wrapper during experimenting.

Analyse the data

python analyse.py