captchaSolveR is an open source packages to tackle the problem of captcha codes. There are in principal two ways to crack a captcha. The first one is to use machine learning and image recognition. The second one to use a mechanical turk (i.e., real humans). There are helper functions for both methods in this package.
Installing directly from Github requires some helper packages and Tesseract. The easiest way to setup captchaSolveR is to source the following «init.R» script and then install and run the package with the packagesGithub function. Furthermore, in order to read as well as solve captcha images, captchaSolveR heavily relies on the «EBImage» package and also utilizes some functions from «systemR» and «ocR». Running the following lines of code installs and loads captchaSolveR:
source("https://rawgit.com/greenore/initR/master/init.R")
packagesBioconductor("EBImage")
packagesGithub(c("systemR", "ocR", "captchaSolveR"), repo_name="greenore")
source(pipe(paste("wget -O -", "https://rawgit.com/greenore/initR/master/init.R")))
packagesBioconductor("EBImage")
packagesGithub(c("systemR", "ocR", "captchaSolveR"), repo_name="greenore")
Tesseract is available directly from many Linux distributions. The package is generally called 'tesseract' or 'tesseract-ocr' - search your distribution's repositories to find it. Packages are also generally available for language training data (search the repositories,) but if not you will need to download the appropriate training data, unpack it, and copy the .traineddata file into the 'tessdata' directory, probably /usr/share/tesseract-ocr/tessdata or /usr/share/tessdata.
# On Ubuntu the following command will install tesseract
sudo apt-get update
sudo apt-get install tesseract-ocr
The easiest way to install Tesseract is with MacPorts. Once it is installed, you can install Tesseract by running the command sudo port install tesseract, and any language with sudo port install tesseract-. List of available langcodes can be found on MacPorts tesseract page.
An installer is available on the Tesseract-OCR project page. This includes the English training data.