/documentscanner

documentscanner allows you to transform (almost) any ADF scanner into a document scanner that produces OCRed PDFs.

Primary LanguageShellGNU General Public License v3.0GPL-3.0

documentscanner

documentscanner allows you to transform (almost) any ADF scanner into a document scanner that produces OCRed PDFs. All you need is

  • a sane-compatible ADF scanner
  • a raspberry pi
  • (optional) a more powerful host to run the OCR tasks

Setup instructions

  1. Check out documentscanner onto a raspberry pi: $ git checkout https://github.com/BastianPoe/documentscanner.git ; cd documentscanner
  2. Install sane and other dependencies$ apt-get install sane sane-utils bash unpaper tesseract-ocr tesseract-ocr-deu imagemagick bc poppler-utils findutils scanbd
  3. Install scanbd script: $ mkdir -p /etc/scanbd/scripts ; cp scanbd/test.script /etc/scanbd/scripts/
  4. Enable scanbd: $ systemctl enable scanbd
  5. Restart scanbd: $ systemctl restart scanbd
  6. Create inbox and outbox: $ mkdir -p /inbox /outbox
  7. Start document processor: $ cd scripts ; ./process.sh /inbox /outbox
  8. Done

What if it does not work

  1. Check if sane recognizes your scanner via $ scanimage -L
  2. Check the logs of scanbd via $ journalctl -f. You should be seeing log outputs whenever you press a button
  3. Modify the events scanbd triggers for in /etc/scanbd/scripts/test.script (currently: scan and email)
  4. Check if scanned raw documents end up in /inbox
  5. Check logfiles of the processor
  6. Check if PDFs end up in /outbox

How it works

Scanning

documentscanner uses scanbd to wait for someone to press a button on the scanner. This triggers the script in /etc/scanbd/scripts/test.script which differentiates which button has been pressed. The script calls /home/pi/documentscanner/scripts/scan.sh and scans all pages available into a folder in /inbox. After completing the scan, a file called complete is placed in the scan directory.

PDF conversion

The processor checks every 10s in /inbox and if there is a new document with the complete flag, the document is processed. Initially, we use identify with a heuristic to identify and remove empty pages. Then, each page is processed using unpaper to remove the background, etc. Subsequently, the pages are OCRed using tesseract and converted to PDFs. Finally, the individual PDFs are joined into one using pdfunite and the scan directory is deleted.

Maintenance required

Incomplete scans (e.g. those where the ADF pulled multiple pages at once) are aborted and never receive the complete flag and hence are not processed by the processor. Check /inbox from time to time to see, which documents have ended up there and delete them.

(Optional) Speed up PDF generation

I run the processor in a Docker container on my Synology NAS. This is way faster than on the raspberry and does not slow down subsequent scans. The required setup steps are quite easy:

  1. Create a new shared directory on your NAS and expose it via NFS to your raspberry pi
  2. Install autofs: $ apt-get install autofs
  3. Add NFS mounting to /etc/auto.misc: documentarchive -rw,soft,intr,rsize=8192,wsize=8192 192.168.1.26:/volume1/documentarchive
  4. Enable auto.misc by adding the following line to /etc/auto.master: /misc /etc/auto.misc
  5. Edit your /etc/scanbd/scripts/test.script to place scans into your output folder. E.g. FOLDER="/misc/documentarchive/scans_raw
  6. Pull bastianpoe/document_archive into the Docker Station on your NAS
  7. Map /inbox onto the NFS share created above and /outbox onto where the PDFs shall be stored
  8. Start the docker container
  9. Done