Extract structured data from unstructured documents using Answer.AI's Byaldi, OpenAI gpt-4o, and Langchain's structured output.
pyenv virtualenv 3.10.6 docai
pyenv activate docai
poetry install
Ensure you have an OPENAI_API_KEY and HF_TOKEN set in your environment variables.
export OPENAI_API_KEY=<your key>
export HF_TOKEN=<your token>
Build the index from the pdfs/ folder:
python scripts/build_index.py --folder "pdfs/" --index_name "application"
Extraction structured information from the index (open extract.py to see queries and pydantic models):
python scripts/extract.py
What losses have occurred in the past 5 years?
LossHistory(
losses=[
Loss(loss_date='2/20/21', loss_amount=7003.0, loss_description='Claimant was in his sleeper when his truck got hit by insured driver on the left', date_of_claim='4/19/21'),
Loss(loss_date='2/4/21', loss_amount=92584.0, loss_description='The IV was attempting to merge on the highway when the IV lost control and struck', date_of_claim='4/30/21'),
Loss(loss_date='9/14/21', loss_amount=5583.0, loss_description='IV was in the fast lane, when IV tire flew off and struck OV1, OV2, OV3, OV4', date_of_claim='9/15/21'),
Loss(loss_date='9/14/21', loss_amount=6299.0, loss_description='IV was in the fast lane, when IV tire flew off and struck OV1, OV2, OV3, OV4', date_of_claim='9/15/21')
]
)
What is the basic application information?
Application(
insured_name='Greentown Burgers LLC',
insured_address='Not provided',
insured_phone='Not provided',
insured_email='Not provided',
effective_date='07/22/2024'
)