/piah

Automatically parse PDF and texts to dataclasses

Primary LanguagePythonMIT LicenseMIT

piah

PyPI - Version PyPI - Python Version


Piah automatically parse the data from PDF's or texts based only in the dataclass that you provide and return the same dataclass fullfilled with the values. Piah is based in the OxyParser

Table of Contents

Installation

pip install piah

Usage

first, set your key in the environment variables like:

import os

os.environ["OPENAI_API_KEY"] = "your-api-key"

or set in a .env file and then just use piah, e.g:

from piah import Piah
from dataclasses import dataclass

@dataclass
class Person:
  name: str
  age: int

parser = Piah("gpt-3.5-turbo")
result = parser.parse("Hello Iam python and I have 33 years old", Person)

to parse PDF's:

result = parser.parse("example.pdf", Person)
#or
result = parser.parse(Path("example.pdf"), Person)

Supported Models and Providers

piah uses LiteLLM, so consult the LiteLLM docs to check if the desired Model is supported.

TODO

  • Write docstrings
  • Improve allowed types
  • Improve system prompt

Know Issues

Seems that piah don't pass every time in the test, because the LLM don't parse correctly every time large PDF's

License

piah is distributed under the terms of the MIT license.