/papertiger

Paper Tiger, imported from Bitbucket repo (Mercurial)

Primary LanguagePascal

Paper Tiger
===========
Scanning, text recognition and archiving of paper documents... 
with GUI clients but from the command line if necessary 

The Paper Tiger code has a liberal MIT license. 
It uses various other open-source programs.

Functionality
=============
Architecture/functionality:
- scanning documents using sane into TIFF documents
- OCR/text recognition using Tesseract
- storage of documents as PDF file (image file+OCR text) e.g. on a Samba share
- index of documents+notes+full text in Firebird database
- server component written in FreePascal, so no X Windows required.
- command line control on server
- CGI REST server component
- Viewer/scanner GUI written in Lazarus+FreePascal
- Initial support for WIA/TWAIN on Windows to support scanning from desktop


Further possible refinements:
- support for other databases (sqlite, PostgreSQL, MS SQL Server)
- using image cleanup tools such as scantailor and unpaper
- write .deb install pacakge for easy installation on Debian servers
- batch import of images/pdfs

Architecture and development principles
- use other people's work if possible - the Unix way...
- if possible, build using modules: 
  e.g. allow use of multiple OCR engines etc
- store OCR text in the PDF, and store the image tiff.
  This enables external tools to work with the PDFs,
  use the PDFs in other applications etc.
- save all OCR text in database or file 
  (e.g. a Lucene index) in order to allow fast search across all documents
- this means synchronizing PDF text with the full text archive may be required
- develop towards a single point of control: 
  tigerservercore, which may speak multiple protocols, e.g. via plugins
- however, use standard methods of storing data (e.g. full text search 
  components), normalized database schema in order to allow programs/tools that 
	don't speak the protocols mentioned above to get data easily
- these 2 principles clash; the code will need to stabilize until it is wise to 
  directly try to access e.g. the database.
	Even then, breaking changes will not be avoided if e.g. cleanness of design 
	would be compromised

Compilation instructions
========================
FPC 2.7.1/trunk is preferred for the server/CGI programs.
At least FPC 2.6.2 fpweb does not accept the DELETE method.
For the client program, Lazarus trunk has been used for development.

1. Compile hgversion.pas, e.g.:
fpc hgversion.pas

2. Compile the program(s) you want
2.1 With Lazarus:
lazbuild tigercgi.lpi
lazbuild tigerclient.lpi
lazbuild tigerserver.lpi
2.2 With FreePascal:
- Run hgversion first to update the version info
fpc -dCGI tigercgi.lpr
fpc tigerserver.lpr

Installation instructions
=========================
- prerequisites: Linux/*nix (virtual) machine. Windows support may come later.
- prerequisites: have sane installed and configured for your scanner. E.g.:
  aptitude install sane-utils
- prerequisites: have tesseract installed and configured. E.g.:
  aptitude install tesseract-ocr tesseract-ocr-eng #for English language support
  Note: we need version 3 because of hOCR support needed for getting searchable 
	PDFs.
- prerequisites: have exactimage installed (for hocr2pdf), e.g.:
  aptitude install exactimage
- Tesseract must/can then be configured to output hocr, e.g.:
  check you have this file present (adjust config directory to your situation):
  cat /usr/local/share/tessdata/configs/hocr
  If not (again, adjust config file location to your situation):
  cat >> /usr/local/share/tessdata/configs/hocr << "EOF_DOCUMENT"
  tessedit_create_hocr 1
  EOF_DOCUMENT
- prerequisites: have pdftk installed (for concatenating pdfs), e.g.:
  aptitude install pdftk
- nice to have: have scantailor installed (for aligning/cleaning up the tiff 
  images before OCR).
  see installation notes below
	
  
Installing the command line server:
- copy hocrwrap.sh to server directory (e.g. /opt/tigerserver/)
- copy scanwrap.sh to server directory
- copy tigerserver to server directory
- go to the server directory and make files executable, e.g. (replace 
  directory with your own if necessary):
  cd /opt/tigerserver/
  chmod u+rx hocrwrap.sh
  chmod u+rx scanwrap.sh
  chmod u+rx tigerserver
- copy tigerserver.ini.template to tigerserver.ini and edit settings to match 
  your environment

Test by running ./tigerserver --help

Installing the cgi application:
- prerequisites: apache2 or another HTTP server that supports cgi
  aptitude install apache2
-	copy tigercgi to cgi directory (e.g. /usr/lib/cgi-bin). 
	Make sure the user Apache runs under may read and execute the file (e.g. 
	chmod ugo+rx tigercgi)
- copy hocrwrap.sh to cgi directory (e.g. /usr/lib/cgi-bin/)
- copy scanwrap.sh to cgi directory
- copy tigercgi to cgi directory
- copy tigerserver.ini.template to tigerserver.ini in the cgi directory and edit
  settings to match your environment
- go to the cgi directory and make files executable for the apache/www user, 
  e.g. (replace directory with your own if necessary):
  cd /usr/lib/cgi-bin/
	# replace user/groups below with correct user/group if needed, e.g. apache2
  chown www-data:www-data hocrwrap.sh 
  chown www-data:www-data scanwrap.sh
  chown www-data:www-data tigercgi
  chown www-data:www-data tigerserver.ini
	# make scripts executable:
  chmod u+rx hocrwrap.sh
  chmod u+rx scanwrap.sh
  chmod u+rx tigercgi
  chmod u+r  tigerserver.ini
  
Installing the client:
- prerequisites: *nix: imagemagick dev libraries installed: e.g. 
  aptitude install imagemagick
- prerequisites: Windows: imagemagick DLLs e.g. Q16 x86 or x64 (depending on 
  papertiger client bitness) version downloaded from 
	http://www.imagemagick.org/script/binary-releases.php in client directory or 
	in path
- compilation without imagemagick is possible (see source code for compiler 
  define) but the program will be much slower
- copy tigerclient.ini.template to tigerclient.ini and edit settings to match 
  your environment


Building Tesseract 3
====================
If tesseract 3 is not available for your platform, you will need to build it.  
Preliminary notes for building Tesseract 3 on Debian aqueeze
sources:
http://ubuntuforums.org/showthread.php?t=1647350
aptitude install build-essential leptonica libleptonica-dev libpng-dev 
libjpeg-dev libtiff-dev zlib1g-dev

# as root:
cd ~
wget https://tesseract-ocr.googlecode.com/files/tesseract-3.01.tar.gz
tar -zxvf tesseract-3.01.tar.gz
cd tesseract-3.01
./runautoconf
./configure
make
checkinstall #follow the prompts and type "y" to create documentation directory.
# Enter a brief description then press enter twice
ldconfig

#language/training data, e.g. for Dutch and English:
#todo: check dir
cd /usr/local/share/tessdata
wget https://tesseract-ocr.googlecode.com/files/nld.traineddata.gz
gunzip nld.traineddata.gz
wget https://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.01.eng.tar.gz
gunzip tesseract-ocr-3.01.eng.tar.gz

Building scantailor from source
===============================
Scantailor is being developed; we use the scantailor enhanced fork.
http://sourceforge.net/projects/scantailor/files/scantailor-devel/enhanced/
Build instructions:
https://sourceforge.net/apps/mediawiki/scantailor/index.php?title=Building_from_source_code_on_Linux_and_Mac_OS_X

Notes for Debian below.

# get compilers and dependencies
aptitude install build-essential cmake libqt4-dev libjpeg-dev zlib1g-dev \
libpng-dev libtiff-dev libtiff5-alt-dev  libboost-dev libxrender-dev \
#libtiff5-alt-dev for good measure; hope it improves tiff support

Get source from git repository:
cd ~
git clone git://git.code.sf.net/p/scantailor/code scantailor
cd scantailor
git checkout enhanced #check out branch called "enhanced"
cmake .
make
su - #switch to root
cd /home/pascaldev/scantailor #or wherever the files are located
make install
exit #out of root

Miscellaneous notes
===================
Getting PDF viewers to open a certain page:
Adobe Acrobat Reader
http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_open_parameters.pdf
acrobat.exe /A "page=<pageNo>"
could also use "nameddest=<named destination>"

SumatraPDF
https://code.google.com/p/sumatrapdf/wiki/CommandLineArguments
sumatrapdf -reuse-instance -page <pageNo>
Scrolls the first indicated file to the indicated page.
Tells an already open SumatraPDF to load the indicated files. If there are 
several running instances, behaviour is undefined.

ImageMagick DLLs on Windows
===========================
The following dlls seem sufficient for converting TIFF images for the client - 
I just copied all dlls:
CORE_RL_bzlib_.dll
CORE_RL_jbig_.dll
CORE_RL_jp2_.dll
CORE_RL_jpeg_.dll
CORE_RL_lcms_.dll
CORE_RL_libxml_.dll
CORE_RL_Magick++_.dll
CORE_RL_magick_.dll
CORE_RL_png_.dll
CORE_RL_tiff_.dll
CORE_RL_ttf_.dll
CORE_RL_wand_.dll
CORE_RL_xlib_.dll
CORE_RL_zlib_.dll
X11.dll
Xext.dll

In modules\coders (just copied all dlls)
IM_MOD_RL_aai_.dll
IM_MOD_RL_art_.dll
IM_MOD_RL_avs_.dll
IM_MOD_RL_bgr_.dll
IM_MOD_RL_bmp_.dll
IM_MOD_RL_braille_.dll
IM_MOD_RL_cals_.dll
IM_MOD_RL_caption_.dll
IM_MOD_RL_cin_.dll
IM_MOD_RL_cip_.dll
IM_MOD_RL_clipboard_.dll
IM_MOD_RL_clip_.dll
IM_MOD_RL_cmyk_.dll
IM_MOD_RL_cut_.dll
IM_MOD_RL_dcm_.dll
IM_MOD_RL_dds_.dll
IM_MOD_RL_debug_.dll
IM_MOD_RL_dib_.dll
IM_MOD_RL_djvu_.dll
IM_MOD_RL_dng_.dll
IM_MOD_RL_dot_.dll
IM_MOD_RL_dps_.dll
IM_MOD_RL_dpx_.dll
IM_MOD_RL_emf_.dll
IM_MOD_RL_ept_.dll
IM_MOD_RL_exr_.dll
IM_MOD_RL_fax_.dll
IM_MOD_RL_fd_.dll
IM_MOD_RL_fits_.dll
IM_MOD_RL_fpx_.dll
IM_MOD_RL_gif_.dll
IM_MOD_RL_gradient_.dll
IM_MOD_RL_gray_.dll
IM_MOD_RL_hald_.dll
IM_MOD_RL_hdr_.dll
IM_MOD_RL_histogram_.dll
IM_MOD_RL_hrz_.dll
IM_MOD_RL_html_.dll
IM_MOD_RL_icon_.dll
IM_MOD_RL_info_.dll
IM_MOD_RL_inline_.dll
IM_MOD_RL_ipl_.dll
IM_MOD_RL_jbig_.dll
IM_MOD_RL_jnx_.dll
IM_MOD_RL_jp2_.dll
IM_MOD_RL_jpeg_.dll
IM_MOD_RL_label_.dll
IM_MOD_RL_mac_.dll
IM_MOD_RL_magick_.dll
IM_MOD_RL_map_.dll
IM_MOD_RL_matte_.dll
IM_MOD_RL_mat_.dll
IM_MOD_RL_meta_.dll
IM_MOD_RL_miff_.dll
IM_MOD_RL_mono_.dll
IM_MOD_RL_mpc_.dll
IM_MOD_RL_mpeg_.dll
IM_MOD_RL_mpr_.dll
IM_MOD_RL_msl_.dll
IM_MOD_RL_mtv_.dll
IM_MOD_RL_mvg_.dll
IM_MOD_RL_null_.dll
IM_MOD_RL_otb_.dll
IM_MOD_RL_palm_.dll
IM_MOD_RL_pango_.dll
IM_MOD_RL_pattern_.dll
IM_MOD_RL_pcd_.dll
IM_MOD_RL_pcl_.dll
IM_MOD_RL_pcx_.dll
IM_MOD_RL_pdb_.dll
IM_MOD_RL_pdf_.dll
IM_MOD_RL_pes_.dll
IM_MOD_RL_pict_.dll
IM_MOD_RL_pix_.dll
IM_MOD_RL_plasma_.dll
IM_MOD_RL_png_.dll
IM_MOD_RL_pnm_.dll
IM_MOD_RL_preview_.dll
IM_MOD_RL_ps2_.dll
IM_MOD_RL_ps3_.dll
IM_MOD_RL_psd_.dll
IM_MOD_RL_ps_.dll
IM_MOD_RL_pwp_.dll
IM_MOD_RL_raw_.dll
IM_MOD_RL_rgb_.dll
IM_MOD_RL_rla_.dll
IM_MOD_RL_rle_.dll
IM_MOD_RL_scr_.dll
IM_MOD_RL_sct_.dll
IM_MOD_RL_sfw_.dll
IM_MOD_RL_sgi_.dll
IM_MOD_RL_stegano_.dll
IM_MOD_RL_sun_.dll
IM_MOD_RL_svg_.dll
IM_MOD_RL_tga_.dll
IM_MOD_RL_thumbnail_.dll
IM_MOD_RL_tiff_.dll
IM_MOD_RL_tile_.dll
IM_MOD_RL_tim_.dll
IM_MOD_RL_ttf_.dll
IM_MOD_RL_txt_.dll
IM_MOD_RL_uil_.dll
IM_MOD_RL_url_.dll
IM_MOD_RL_uyvy_.dll
IM_MOD_RL_vicar_.dll
IM_MOD_RL_vid_.dll
IM_MOD_RL_viff_.dll
IM_MOD_RL_wbmp_.dll
IM_MOD_RL_webp_.dll
IM_MOD_RL_wmf_.dll
IM_MOD_RL_wpg_.dll
IM_MOD_RL_xbm_.dll
IM_MOD_RL_xcf_.dll
IM_MOD_RL_xc_.dll
IM_MOD_RL_xpm_.dll
IM_MOD_RL_xps_.dll
IM_MOD_RL_xtrn_.dll
IM_MOD_RL_xwd_.dll
IM_MOD_RL_x_.dll
IM_MOD_RL_ycbcr_.dll
IM_MOD_RL_yuv_.dll

in modules\filters
analyze.dll