Swiftly add ocr layers to scanned pdf files.
Unfortunately existing open source ocr solutions (tesseract) pale in comparison with the ones commercially available. The azure read api provides particularly good results. It is also easy to set up, but while it can annotate text in images, there is no easy way to upload and ocr a full pdf document.
That is, until now. aocr
provides an easy way to ocr full pdf documents.
aocr
can be predominantly used in two ways: It can either be called from a shell as an ELF binary on linux, or it can
function as a java API library for reuse in other projects.
Add the following snippet to your dependencies:
<dependency>
<groupId>de.niklasfi.aocr</groupId>
<artifactId>aocr</artifactId>
<version>1.3</version>
</dependency>
To call aocr
, you first have to construct an instance of de.niklasfi.aocr.AzurePdfOcr
. It can be constructed like so:
final var apiHandler = new AzureApiHandler(azureEndpoint, azureSubscriptionKey);
final var pdfImageRetriever = new PdfImageExtractor(); // or alternatively: new PdfImageRenderer();
final var pdfIoUtil = new PdfIoUtil();
final var fileUtil = new FileUtil();
final var azurePdfOcr = new AzurePdfOcr(apiHandler, pdfImageRetriever, pdfIoUtil, fileUtil);
Now call ocr
azurePdfOcr.ocr(inputPath, outputPath) // or one of the other variants
cd $your_git_repo
mvn package
usage: aocr
-c,--render-color <arg> color scheme to use when rendering page from
input pdf into an image. Possible values:
- binary: convert to black / white image
- gray: convert to grayscale image
- rgb (default): convert to full color image
-d,--render-dpi <arg> dpi to use when rendering page from input
pdf into an image. Defaults to 300 dpi.
-e,--endpoint <arg> azure cognitive services endpoint url
-i,--input <arg> path to input pdf file
-k,--key <arg> subscription key to access azure cognitive
services
-o,--output <arg> path to save output to
-r,--retrieve-method <arg> method to use to retrieve images from input
pdf. Possible values:
- extract (default): use the largest image
on the page (useful for scans)
- render: render the page into an image. dpi
and color modes may be configured using
--render-dpi and --render-color
for example:
./target/aocr
-e $your_azure_cognitive_services_endpoint_url \
-k $your_azure_subscription_key \
-i $your_input_file \
-o $your_output_file