/zerox

Zero shot pdf OCR with gpt-4o-mini

Primary LanguageTypeScriptMIT LicenseMIT

Hero Image

Zerox OCR

A dead simple way of OCR-ing a document for AI ingestion. Documents are meant to be a visual representation after all. With weird layouts, tables, charts, etc. The vision models just make sense!

The general logic:

  • Pass in a PDF (URL or file buffer)
  • Turn the PDF into a series of images
  • Pass each image to GPT and ask nicely for Markdown
  • Aggregate the responses and return Markdown

Sounds pretty basic! But with the gpt-4o-mini this method is price competitive with existing products, with meaningfully better results.

Pricing Comparison

This is how the pricing stacks up to other document processers. Running 1,000 pages with Zerox uses about 25M input tokens and 0.4M output tokens.

Service Cost Accuracy Table Quality
AWS Textract [1] $1.50 / 1,000 pages Low Low
Google Document AI [2] $1.50 / 1,000 pages Low Low
Azure Document AI [3] $1.50 / 1,000 pages High Mid
Unstructured (PDF) [4] $10.00 / 1,000 pages Mid Mid
------------------------ -------------------- -------- -------------
Zerox (gpt-mini) $ 4.00 / 1,000 pages High High

Installation

npm install zerox

Zerox uses graphicsmagick and ghostscript for the pdf => image processing step. These should be pulled automatically, but you may need to manually install.

Usage

With file URL

import { zerox } from "zerox";

const result = await zerox({
  filePath: "https://omni-demo-data.s3.amazonaws.com/test/cs101.pdf",
  openaiAPIKey: process.env.OPENAI_API_KEY,
});

From local path

import path from "path";
import { zerox } from "zerox";

const result = await zerox({
  filePath: path.resolve(__dirname, "./cs101.pdf"),
  openaiAPIKey: process.env.OPENAI_API_KEY,
});

Options

const result = await zerox({
  // Required
  filePath: "path/to/file",
  openaiAPIKey: process.env.OPENAI_API_KEY,

  // Optional
  concurrency: 10, // Number of pages to run at a time.
  maintainFormat: false, // Slower but helps maintain consistent formatting.
  cleanup: true, // Clear images from tmp after run.
  outputDir: undefined, // Save combined result.md to a file
  tempDir: "/os/tmp", // Directory to use for temporary files (default: system temp directory)
});

The maintainFormat option trys to return the markdown in a consistent format by passing the output of a prior page in as additional context for the next page. This requires the requests to run synchronously, so it's a lot slower. But valueable if your documents have a lot of tabular data, or frequently have tables that cross pages.

Request #1 => page_1_image
Request #2 => page_1_markdown + page_2_image
Request #3 => page_2_markdown + page_3_image

Example Output

{
  completionTime: 10038,
  fileName: 'invoice_36258',
  inputTokens: 25543,
  outputTokens: 210,
  pages: [
    {
      content: '# INVOICE # 36258\n' +
        '**Date:** Mar 06 2012  \n' +
        '**Ship Mode:** First Class  \n' +
        '**Balance Due:** $50.10  \n' +
        '## Bill To:\n' +
        'Aaron Bergman  \n' +
        '98103, Seattle,  \n' +
        'Washington, United States  \n' +
        '## Ship To:\n' +
        'Aaron Bergman  \n' +
        '98103, Seattle,  \n' +
        'Washington, United States  \n' +
        '\n' +
        '| Item                                       | Quantity | Rate   | Amount  |\n' +
        '|--------------------------------------------|----------|--------|---------|\n' +
        "| Global Push Button Manager's Chair, Indigo | 1        | $48.71 | $48.71  |\n" +
        '| Chairs, Furniture, FUR-CH-4421             |          |        |         |\n' +
        '\n' +
        '**Subtotal:** $48.71  \n' +
        '**Discount (20%):** $9.74  \n' +
        '**Shipping:** $11.13  \n' +
        '**Total:** $50.10  \n' +
        '---\n' +
        '**Notes:**  \n' +
        'Thanks for your business!  \n' +
        '**Terms:**  \n' +
        'Order ID : CA-2012-AB10015140-40974  ',
      page: 1,
      contentLength: 747
    }
  ]
}

License

This project is licensed under the MIT License.