/deepdive

Primary LanguagePythonApache License 2.0Apache-2.0

deepdive

init.sh

Use the init.sh script to quickly set up a condor-ready submission area. To use,

mkdir job_submission
cd job_submission
git clone https://github.com/iross/deepdive
# CREATE shared/ folder that contains URLS, executable
cd deepdive
sh init.sh

It will create a job_submission/ChtcRun with additional job creation scripts in place.

OCR

Dealing with different file types

Currently, we made these basic assumptions when dealing with different files

  • PDF: one article == one PDF, which has multiple pages
  • TIFF: one article == one folder, which contains multiple TIFF files, one TIFF == one page

Reference

http://tfischernet.wordpress.com/2008/11/26/searchable-pdfs-with-linux/

Known Issues

Let Cuneiform accept TIFF as its input

You have to compile cuneiform with ImageMagick++

The simplest solution is apt-get install libmagick++-dev libmagick++1 Otherwiese you should download ImageMagick and compile it firstly

There is a bug that cmake could not find ImageMagick after the compilation and installation. (Assuming compile it with ./configure --prefix=$HOME/local) One trick hack is to violently modify $vim cuneiform-linux-1.1.0/builddir/CMakeCache.txt

//Path to the ImageMagick include dir.
ImageMagick_Magick++_INCLUDE_DIR:PATH=/u/z/h/zhaoyu/local/include/ImageMagick-6/

//Path to the ImageMagick Magick++ library.
ImageMagick_Magick++_LIBRARY:FILEPATH=/u/z/h/zhaoyu/local/lib/libMagick++-6.Q16.so

USGS

  • If the target web page does not contain any files end with .pdf, we simply abort this task which would simplify our work
  • If the target web page is a HTML rather than a PDF, we ignore

Requirements

  • beautifulsoup4 is a must
  • pypdf2 is for checking the PDF integrity

tools

This folder contains some convenient scripts which do not belong to deep-dive