VikParuchuri/pdf_to_md

Python

Convert PDFs to markdown

Extract text from pdf with pymupdf
Remove headers/footers using clustering with DBScan algorithm
Convert text to markdown with a finetuned LLM

Known issues: it will repeat text if the generation goes off the rails. I need to retrain the model using some lessons from the nougat paper.

Installation

poetry install

Usage