aocr (azure OCR)

Swiftly add ocr layers to scanned pdf files.

Unfortunately existing open source ocr solutions (tesseract) pale in comparison with the ones commercially available. The azure read api provides particularly good results. It is also easy to set up, but while it can annotate text in images, there is no easy way to upload and ocr a full pdf document.

That is, until now. aocr provides an easy way to ocr full pdf documents.

usage

aocr can be predominantly used in two ways: It can either be called from a shell as an ELF binary on linux, or it can function as a java API library for reuse in other projects.

API

maven

Add the following snippet to your dependencies:

<dependency>
  <groupId>de.niklasfi.aocr</groupId>
  <artifactId>aocr</artifactId>
  <version>1.3</version>
</dependency>

API

To call aocr, you first have to construct an instance of de.niklasfi.aocr.AzurePdfOcr. It can be constructed like so:

final var apiHandler = new AzureApiHandler(azureEndpoint, azureSubscriptionKey);
final var pdfImageRetriever = new PdfImageExtractor(); // or alternatively: new PdfImageRenderer();
final var pdfIoUtil = new PdfIoUtil();
final var fileUtil = new FileUtil();

final var azurePdfOcr = new AzurePdfOcr(apiHandler, pdfImageRetriever, pdfIoUtil, fileUtil);

Now call ocr

azurePdfOcr.ocr(inputPath, outputPath) // or one of the other variants

command line

build

cd $your_git_repo
mvn package

usage

usage: aocr
 -c,--render-color <arg>      color scheme to use when rendering page from
                              input pdf into an image. Possible values:
                              - binary: convert to black / white image
                              - gray: convert to grayscale image
                              - rgb (default): convert to full color image
 -d,--render-dpi <arg>        dpi to use when rendering page from input
                              pdf into an image. Defaults to 300 dpi.
 -e,--endpoint <arg>          azure cognitive services endpoint url
 -i,--input <arg>             path to input pdf file
 -k,--key <arg>               subscription key to access azure cognitive
                              services
 -o,--output <arg>            path to save output to
 -r,--retrieve-method <arg>   method to use to retrieve images from input
                              pdf. Possible values:
                              - extract (default): use the largest image
                              on the page (useful for scans)
                              - render: render the page into an image. dpi
                              and color modes may be configured using
                              --render-dpi and --render-color

for example:

./target/aocr 
    -e $your_azure_cognitive_services_endpoint_url \
    -k $your_azure_subscription_key \
    -i $your_input_file \
    -o $your_output_file

niklasfi/aocr

aocr (azure OCR)

usage

API

maven

API

command line

build

usage