This is a small cli-only python program to scan mainly documents with one or many pages, optimize them (black/white), put them together in a pdf and do an optical character recognition (OCR). The result is a as-small-as-possible pdf file which is full-text searchable and ready to be put in a digital archive.
You will need the following additional software:
- python3
- sane (scanning)
- tesseract (ocr)
- stapler (pdf manipulation tool)
- python-pyparallel (parallel processing)
- imagemagick (optimization black/white)
execute scanimage -L (sane package) in a shell. Sample output:
device `v4l:/dev/video2' is a Noname Integrated Camera: Integrated I virtual device
device `v4l:/dev/video0' is a Noname Integrated Camera: Integrated C virtual device
device `dsseries:usb:0x04F9:0x60E0' is a BROTHER DS-620 sheetfed scanner
The third one is my main scanner (dsseries ... BROTHER ...). In my case I would use dsseries as device.
You can use everything sane (scanimage) supports. Default is:
--mode Gray #Gray (a!) | Lineart | Color
change this to Color if you want to scan something which is not black/white only. Lineart is another option intended for black/white documents, but Gray works better for me.
pdf should be fine for most cases.
pnm should be fine for most cases.
Just download, make executable, set the basic configuration options (device) and execute in a shell.
to be contined
./scan2file -o Output
Will scan a single page, optimize it to black/white, ocr and safe as Output.pdf in the current directory.
./scan2file -o Output -mu
Will first ask for the number of pages, then scan everything and after that optimize and ocr the whole multi-page pdf
- tiff: scanned raw data
- pnm: temporary files and image type converted into the final pdf document
- djvu: disabled for the moment
- automate Gray/Color scan option depending on the --color argument
- multiple pages using ADF (feeder) scanner
- tackle down some problems, left-over temp files, etc
- make sure PDF/A format is used everywhere