/find-dogs

A small project to find the word "dog" in PDFs with descriptions of medical experiments (on animals).

Primary LanguagePython

Dog finder

A small project to find the word "dog" in PDFs with descriptions of medical experiments (on animals). The data comes from https://eagri.cz.

Dependencies

How to use it

The idea is:

  1. Convert PDF to PNG (because a PDF file contains a picture)
  2. Run tesseract and get the text fromt he image.
  3. Tokenize the text with spacy.
  4. Find the word pes.
  5. Write the results into analysis.csv.

Get processed text

. ./convert.sh <folder_with_pdfs>

Calling tesseract directly worked better than Python wrapper during experimenting.

Analyse the data

python analyse.py