deepdive
init.sh
Use the init.sh script to quickly set up a condor-ready submission area. To use,
mkdir job_submission
cd job_submission
git clone https://github.com/iross/deepdive
# CREATE shared/ folder that contains URLS, executable
cd deepdive
sh init.sh
It will create a job_submission/ChtcRun with additional job creation scripts in place.
OCR
Dealing with different file types
Currently, we made these basic assumptions when dealing with different files
- PDF: one article == one PDF, which has multiple pages
- TIFF: one article == one folder, which contains multiple TIFF files, one TIFF == one page
Reference
http://tfischernet.wordpress.com/2008/11/26/searchable-pdfs-with-linux/
Known Issues
Let Cuneiform accept TIFF as its input
You have to compile cuneiform with ImageMagick++
The simplest solution is apt-get install libmagick++-dev libmagick++1
Otherwiese you should download ImageMagick and compile it firstly
There is a bug that cmake could not find ImageMagick after the compilation and
installation. (Assuming compile it with ./configure --prefix=$HOME/local
)
One trick hack is to violently modify $vim cuneiform-linux-1.1.0/builddir/CMakeCache.txt
//Path to the ImageMagick include dir.
ImageMagick_Magick++_INCLUDE_DIR:PATH=/u/z/h/zhaoyu/local/include/ImageMagick-6/
//Path to the ImageMagick Magick++ library.
ImageMagick_Magick++_LIBRARY:FILEPATH=/u/z/h/zhaoyu/local/lib/libMagick++-6.Q16.so
USGS
- If the target web page does not contain any files end with .pdf, we simply abort this task which would simplify our work
- If the target web page is a HTML rather than a PDF, we ignore
Requirements
- beautifulsoup4 is a must
- pypdf2 is for checking the PDF integrity
tools
This folder contains some convenient scripts which do not belong to deep-dive