/docai

Structured information extraction from documents

Primary LanguagePythonMIT LicenseMIT

DocAI

Extract structured data from unstructured documents using Answer.AI's Byaldi, OpenAI gpt-4o, and Langchain's structured output.

Installation

pyenv virtualenv 3.10.6 docai
pyenv activate docai
poetry install

Environment vars

Ensure you have an OPENAI_API_KEY and HF_TOKEN set in your environment variables.

export OPENAI_API_KEY=<your key>
export HF_TOKEN=<your token>

Sample usage

Build the index from the pdfs/ folder:

python scripts/build_index.py --folder "pdfs/" --index_name "application"

Extraction structured information from the index (open extract.py to see queries and pydantic models):

python scripts/extract.py

Sample output

What losses have occurred in the past 5 years?
LossHistory(
    losses=[
        Loss(loss_date='2/20/21', loss_amount=7003.0, loss_description='Claimant was in his sleeper when his truck got hit by insured driver on the left', date_of_claim='4/19/21'),
        Loss(loss_date='2/4/21', loss_amount=92584.0, loss_description='The IV was attempting to merge on the highway when the IV lost control and struck', date_of_claim='4/30/21'),
        Loss(loss_date='9/14/21', loss_amount=5583.0, loss_description='IV was in the fast lane, when IV tire flew off and struck OV1, OV2, OV3, OV4', date_of_claim='9/15/21'),
        Loss(loss_date='9/14/21', loss_amount=6299.0, loss_description='IV was in the fast lane, when IV tire flew off and struck OV1, OV2, OV3, OV4', date_of_claim='9/15/21')
    ]
)

What is the basic application information?
Application(
  insured_name='Greentown Burgers LLC', 
  insured_address='Not provided', 
  insured_phone='Not provided',
  insured_email='Not provided', 
  effective_date='07/22/2024'
)