document-processing
There are 91 repositories under document-processing topic.
enoch3712/ExtractThinker
ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.
dhlab-epfl/dhSegment
Generic framework for historical document processing
awslabs/project-lakechain
:zap: Cloud-native, AI-powered, document processing pipelines on AWS.
formkiq/formkiq-core
A full-featured Document Management Platform / Document Layer for your application, providing storage, discovery, processing, and retrieval. Deploys directly into your Amazon Web Services Cloud. Please 🌟 star to support our work!
awslabs/rhubarb
A Python framework for multi-modal document understanding with Amazon Bedrock
iamarunbrahma/pdf-to-markdown
Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.
parsee-ai/parsee-core
Retrieval of fully structured data made easy. Use LLMs or custom models. Specialized on PDFs and HTML files. Extensive support of tabular data extraction and multimodal queries.
steindani/pandoc-include
An include filter for Pandoc
aws-solutions/enhanced-document-understanding-on-aws
Enhanced Document Understanding on AWS delivers an easy-to-use web application that ingests and analyzes documents, extracts content, identifies and redacts sensitive customer information, and creates search indexes from the analyzed data.
cburschka/lyx
Unofficial mirror of git://git.lyx.org/lyx.git (updates daily. not affiliated with lyx.org.)
kili-technology/awesome-datasets
A comprehensive list of annotated training datasets classified by use case.
afrozas/proceedings
Semantic extraction from conference proceedings.
jmanhype/DSPy-Multi-Document-Agents
An advanced distributed knowledge fabric for intelligent document processing, featuring multi-document agents, optimized query handling, and semantic understanding.
MBAigner/PDFSegmenter
This library builds a graph-representation of the content of PDFs. The graph is then clustered, resulting page segments are classified and returned. Tables are retrieved formatted as a CSV.
greed2411/tokyo
tokyo, a REST API, when given any type of document 📄, Identifies mime-type 🧐. Suggests extension 🦔. Alas Extracts text 💪.
aws-samples/sample-document-processing-with-amazon-bedrock-data-automation
This repository contains examples for customers to get started using Amazon Bedrock Data Automation. The samples focus mainly on document processing use cases
eklem/stopword-trainer
A module for creating stopword lists for any language, based on a set of documents.
abdullahshafiq-20/ResumeConvertorLatex
ResumeTex is an AI-powered tool that converts standard PDF resumes into professionally formatted LaTeX documents. This service helps you create elegant, structured resumes without needing to learn LaTeX syntax.
thammuio/doc-genius-ai
DocGenius AI - Generative AI Chatbot for your Documents
Swiftgum/swiftgum
ETL for RAG. Transform any source into LLM-ready markdown. Focus on your AI, not integrations.
drgsn/filefusion
FileFusion is a powerful file concatenation tool designed specifically for Large Language Model (LLM)
abdur75648/urdu-text-detection
Text line detection for Urdu OCR (UTRNet)
aws-samples/idp-invoice-automation-using-bedrock-data-automation-cdk
Serverless Intelligent Document Processing (IDP) solution for invoice automation using Amazon Bedrock Data Automation. Features automated data extraction, annotation, and processing pipeline built with AWS services and CDK.
jayllfpt/table2html
A Python package that converts table images into HTML format using Object Detection model and OCR.
jeanbaptisteb/doccleaner
A Python command-line utility intended for automating some copyediting tasks in documents. It allows editing zipped, XML-based files (e.g. docx, odt, or epub), through XSLT stylesheets. Can be rather easily extended with your own custom xsl stylesheets.
AmadeusITGroup/docs2vecs
CLI that helps with docs splitting, embedding and exposing them in a seamless manner
CentralFloridaAttorney/zmongo_retriever
zmongo_toolbag contains an easy to use MongoDB wrapper with a Langchain Vector Search Retriever implementation
aws-samples/sample-for-multi-modal-document-to-json-with-sagemaker-ai
This open-source project delivers a complete pipeline for converting multi-page documents (PDFs/images) into structured JSON using Vision LLMs on Amazon SageMaker. The solution leverages the SWIFT Framework to fine-tune models specifically for document understanding tasks.
Gaurav4604/buddy
Buddy is a Retrieval Augmented Generation based python toolkit, to help students studying STEM subjects
Shahrom-S/BarsAI
AI assistant
SvenEichelsheimer/filegazer
FileGazer - deep file analysing and categorisation
Huang-lab/figure-extractor
Flask-based service using PDFFigures 2.0 to extract figures and tables from scholarly PDFs. Features REST API, CLI, Docker support, and JSON metadata output (~1.5s/page processing). Designed for document processing and RAG pipelines.
Jayanth-MKV/advanced-rag-cookbooks
Advanced RAG Techniques and Projects
kallebysantos/ocrlot
A distributed ocr engine 🐆
mansi104-ai/BrevityAI
BrevityAI 🚀simplifies understanding lengthy documents by generating concise, accurate summaries. Built with T5, Streamlit, and Python, it supports PDF and JSON inputs with customizable options for summary length and brevity.