Document processing with Ollama

Introduction

This is a repository used to demonstrate how to use LLMs for document processing. It is part of a course there fore the xml extraction part is left as a task for the students. The repository is build with poetry and uses the following libraries:

  • Langchain
  • Loguru
  • Jinja2
  • PyPdfium2

Installation

To install this project please use poetry. The project is build with Python 3.11

git clone https://github.com/mfmezger/document-processing-ollama
cd document-processing-ollama

poetry install

Data

The dataset used is the Samples of electronic Invoices Dataset from Mendeley Data. The dataset ist available here: https://data.mendeley.com/datasets/tnj49gpmtz/2 and licenced under CC BY 4.0.