pdf-to-text

There are 81 repositories under pdf-to-text topic.

infiniflow/ragflow
RAGFlow is a leading open-source Retrieval-Augmented Generation (RAG) engine that fuses cutting-edge RAG with Agent capabilities to create a superior context layer for LLMs
Language:TypeScript64.3k 293 5.2k6.7k
docling-project/docling
Get your documents ready for gen AI
Language:Python38.7k 165 1.2k2.7k
Unstructured-IO/unstructured
Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.
Language:HTML12.7k 68 1.2k1k
run-llama/llama_cloud_services
Knowledge Agents and Management in the Cloud
Language:TypeScript4.1k 26 455449
enoch3712/ExtractThinker
ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.
Language:Python1.4k 20 147137
Academic-Hammer/SciTSR
Table structure recognition dataset of the paper: Complicated Table Structure Recognition
Language:Python360 12 4258
pd3f/pd3f
🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based
Language:HTML327 7 2239
shoryasethia/markdrop
A Python package for converting PDFs to markdown while extracting images and tables, generate descriptive text descriptions for extracted tables/images using several LLM clients. And many more functionalities. Markdrop is available on PyPI.
Language:Python151 1 63
NanoNets/ocr-python
OCR library to extract text & tables from PDF files and images. Convert any image or PDF to CSV / TXT / JSON / Searchable PDF.
Language:Jupyter Notebook113 3 614
nainiayoub/pdf-text-data-extractor
PDF text data extraction web app with OCR for scanned documents
Language:Python88 4 449
datalogics/adobe-pdf-library-samples
Sample code for the Datalogics C++, Java, and .NET interfaces of the Adobe PDF Library
82 26 1162
BitMiracle/Docotic.Pdf.Samples
C# and VB.NET samples for Docotic.Pdf library
Language:Visual Basic .NET78 10 1439
galkahana/pdf-text-extraction
cli for extracting text from PDF files (and maybe possibly tables)
Language:C++78 2 1120
papercast-dev/papercast
A Python pipeline tool and plugin ecosystem for processing technical documents. Process papers from arXiv, SemanticScholar, PDF, with GROBID, LangChain, listen as podcast. Customize your own pipelines.
Language:Python52 1 91
mbzuai-oryx/KITAB-Bench
[ACL 2025 🔥] A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding
Language:Python491
iditectweb/converter
Standalone .NET Converter library, not require Adobe Acrobat component nor Microsoft Office Interop Assemblies, to convert PDF, DOCX, XLSX, HTML, Image, CSV, RTF, TXT in .NET framework
Language:C#40 2 312
seinecle/nocodefunctions-web-app
The code base of the front-end of nocodefunctions.com
Language:CSS38 3 37
shine-jayakumar/Extract-Data-From-PDF-In-Python
Batch-convert pdf to text, extract data from pdf in python
Language:Python30 1 012
asika32764/php-pdf-2-text
Simple PHP PDF to Text class
Language:PHP24 3 117
graphlit/graphlit
Graphlit Platform
22 1 01
asepmaulanaismail/pdf-to-txt-python
Simple pdf to text with python using PDFtk and PyPDF2
Language:Python21 2 115
LuisAraujo/API-Tabua-Mare
[Eng] API for obtaining data from the Tide Table, using web scraping. [Pt-Br] API para Obtenção da Tábua de Maré diária, usando web scraping com PHP.
Language:JavaScript17 4 58
Clearedge-AI/clearedge
Build a RAG preprocessing pipeline
Language:Jupyter Notebook12 2 00
madnight/pdf-layout-text-stripper
Converts a pdf file into a text file while keeping the layout of the original pdf.
Language:Java11 3 03
andrealenzi11/py-poppleract
Python library and Web service based on Poppler Pdftotext utility and Tesseract OCR for extracting text from PDF documents
Language:Python10 1 02
aspose-pdf/Aspose.PDF-for-JavaScript-via-CPP
Aspose.PDF for Javascript via C++
Language:HTML10 14 00
AshkanAbd/pdf2word-GUI
convert pdf to word
Language:Java9 3 06
datalogics/apdfl-cplusplus-samples
Sample code for the Datalogics C++ interface of the Adobe PDF Library
Language:C++9 11 07
asiff00/bangla-pdf-ocr
Bangla PDF to text converter that works on Windows, macOS, and Linux without any extra downloads or configurations.
Language:Python8 1 02
bytescout/pdf-extractor-sdk-samples
ByteScout PDF Extractor SDK source code samples
Language:C#8 1 05
datalogics/apdfl-csharp-dotnet-samples
Sample code for the Datalogics .NET interface of the Adobe PDF Library
Language:C#8 10 19
mic-kul/pdf-textstream
JRuby gem to pdf to text while keeping the layout from original pdf file
Language:Java8 1 00
ExceptedPrism3/PDFToAudio
"PDF To Audio" is a Python tool that transforms PDF documents into audio files using OCR and Text-to-Speech technology. Ideal for accessibility and auditory learning, it supports multiple languages, parallel processing, and smart rate limit handling.
Language:Python7 1 02
monambike/pdfconverter-pdftables-to-csv
Python project that converts tables inside PDFs to CSV for convenient data manipulation. It has log and exception handling.
Language:Python7 2 2411
renan-siqueira/python-pdf-tool
This project facilitates the extraction of text from PDF files using various Python libraries. It is designed to be flexible, allowing the choice among different text extraction libraries and supporting both single PDF file and directory containing multiple PDF files.
Language:Python6 1 01
adaptaware/ragit
A RAG back and front end application
Language:Python4 2 00

pdf-to-text

infiniflow/ragflow

docling-project/docling

Unstructured-IO/unstructured

run-llama/llama_cloud_services

enoch3712/ExtractThinker

Academic-Hammer/SciTSR

pd3f/pd3f

shoryasethia/markdrop

NanoNets/ocr-python

nainiayoub/pdf-text-data-extractor

datalogics/adobe-pdf-library-samples

BitMiracle/Docotic.Pdf.Samples

galkahana/pdf-text-extraction

papercast-dev/papercast

mbzuai-oryx/KITAB-Bench

iditectweb/converter

seinecle/nocodefunctions-web-app

shine-jayakumar/Extract-Data-From-PDF-In-Python

asika32764/php-pdf-2-text

graphlit/graphlit

asepmaulanaismail/pdf-to-txt-python

LuisAraujo/API-Tabua-Mare

Clearedge-AI/clearedge

madnight/pdf-layout-text-stripper

andrealenzi11/py-poppleract

aspose-pdf/Aspose.PDF-for-JavaScript-via-CPP

AshkanAbd/pdf2word-GUI

datalogics/apdfl-cplusplus-samples

asiff00/bangla-pdf-ocr

bytescout/pdf-extractor-sdk-samples

datalogics/apdfl-csharp-dotnet-samples

mic-kul/pdf-textstream

ExceptedPrism3/PDFToAudio

monambike/pdfconverter-pdftables-to-csv

renan-siqueira/python-pdf-tool

adaptaware/ragit