Document AI LABS

This repository contains sample codes for Document AI of GCP. These are mainly python scripts to be copied and reused, rather than full .ipynb notebooks.

Setup and authentication instructions of Vertex SDK are available here. Please, complete those before trying any of the labs below.

Lab 1: Form parser

This lab contains a script to make predictions with the Form parser. It uses a public pdf sample located at gs://cloud-samples-data/documentai/form.pdf.

One of the scripts returns a pandas dataframe with the fields detected, as well as bounding boxes, generating a result like the following:

Bounding boxes result

Lab 2: Invoice parser and Human-in-the-loop

This lab contains some scripts to make predictions with the invoice parser. It uses a public pdf sample located at gs://cloud-samples-data/documentai/invoice.pdf.

The invoice parser, as well as other specialized processors, supports Human-in-the-loop (HITL) for reviewing. There are two ways to trigger a HITL operation: REST API or Python SDK.

  1. With REST API you need to invoke the projects.locations.processors.process method. Note the document file must be inline encoded in base64.
curl -X POST \
    -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
    -H "Content-Type: application/json; charset=utf-8" \
    -d @request.json \
    https://eu-documentai.googleapis.com/v1/projects/655797269815/locations/us/processors/bad52526b46aa2b6:process
  1. With Python SDK you need to invoke DocumentProcessorServiceClient() function:
client = documentai.DocumentProcessorServiceClient()

HITL labeler console

Additionally, both the invoice parser supports Enterprise Knowledge Graph (EKG) for enrichment. Normalized or enriched fields include:

  • Supplier Name (supplier_name)
  • Supplier Address (supplier_address)
  • Date
  • Number
  • Price
  • Phone Number (supplier_phone)

EKG

Lab 3: W-8 (FACTA) and W-9 parser

This lab contains some scripts to make predictions with the W-9 parser. It can be used for both W-8 (FACTA) and W-9 docs. The difference between W-8 and W-9 forms lies in the fact that the W-9 tax form is only required to be used by US companies or companies operating in the US.

Pretty table result from the python script:

W9 specialized parser result

Lab 4: Tables

This lab extracts tables using the form parser, documentation here. It focus only of the JSON output containing the tables information. there are two scripts:

  • tables.py: extract tables from a pdf file
  • 2csv.py: extract tables from a pdf file, and convert the JSON output to CSV

Another sample code to extract tables can be found here.

And an example that uses Pandas to convert the table to CSV here.

References

[1] Codelab: Use Procurement Document AI to Parse your Invoices using AI Platform Notebooks
[2] Codelab: Intro to Document AI and OCR
[3] Codelab: Specialized processors with Document AI
[4] Codelab: Human in the Loop
[5] Codelab: Form parsing
[6] Repository: Google Cloud Document AI github repository