document-processing

There are 91 repositories under document-processing topic.

  • ExtractThinker

    enoch3712/ExtractThinker

    ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.

    Language:Python1.4k21172137
  • dhlab-epfl/dhSegment

    Generic framework for historical document processing

    Language:Python3792750114
  • awslabs/project-lakechain

    :zap: Cloud-native, AI-powered, document processing pipelines on AWS.

    Language:TypeScript185113325
  • formkiq-core

    formkiq/formkiq-core

    A full-featured Document Management Platform / Document Layer for your application, providing storage, discovery, processing, and retrieval. Deploys directly into your Amazon Web Services Cloud. Please 🌟 star to support our work!

    Language:Java141626422
  • rhubarb

    awslabs/rhubarb

    A Python framework for multi-modal document understanding with Amazon Bedrock

    Language:Python9461312
  • iamarunbrahma/pdf-to-markdown

    Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.

    Language:Python94327
  • parsee-ai/parsee-core

    Retrieval of fully structured data made easy. Use LLMs or custom models. Specialized on PDFs and HTML files. Extensive support of tabular data extraction and multimodal queries.

    Language:Python67401
  • steindani/pandoc-include

    An include filter for Pandoc

    Language:Haskell6241120
  • aws-solutions/enhanced-document-understanding-on-aws

    Enhanced Document Understanding on AWS delivers an easy-to-use web application that ingests and analyzes documents, extracts content, identifies and redacts sensitive customer information, and creates search indexes from the analyzed data.

    Language:JavaScript40162018
  • cburschka/lyx

    Unofficial mirror of git://git.lyx.org/lyx.git (updates daily. not affiliated with lyx.org.)

    Language:C++36607
  • kili-technology/awesome-datasets

    A comprehensive list of annotated training datasets classified by use case.

  • afrozas/proceedings

    Semantic extraction from conference proceedings.

    Language:Python31001
  • jmanhype/DSPy-Multi-Document-Agents

    An advanced distributed knowledge fabric for intelligent document processing, featuring multi-document agents, optimized query handling, and semantic understanding.

    Language:Python27102
  • MBAigner/PDFSegmenter

    This library builds a graph-representation of the content of PDFs. The graph is then clustered, resulting page segments are classified and returned. Tables are retrieved formatted as a CSV.

    Language:Python23103
  • greed2411/tokyo

    tokyo, a REST API, when given any type of document 📄, Identifies mime-type 🧐. Suggests extension 🦔. Alas Extracts text 💪.

    Language:Clojure18100
  • aws-samples/sample-document-processing-with-amazon-bedrock-data-automation

    This repository contains examples for customers to get started using Amazon Bedrock Data Automation. The samples focus mainly on document processing use cases

    Language:Jupyter Notebook161012
  • eklem/stopword-trainer

    A module for creating stopword lists for any language, based on a set of documents.

    Language:JavaScript141620
  • ResumeConvertorLatex

    abdullahshafiq-20/ResumeConvertorLatex

    ResumeTex is an AI-powered tool that converts standard PDF resumes into professionally formatted LaTeX documents. This service helps you create elegant, structured resumes without needing to learn LaTeX syntax.

    Language:JavaScript130
  • doc-genius-ai

    thammuio/doc-genius-ai

    DocGenius AI - Generative AI Chatbot for your Documents

    Language:Python13206
  • Swiftgum/swiftgum

    ETL for RAG. Transform any source into LLM-ready markdown. Focus on your AI, not integrations.

    Language:TypeScript11200
  • drgsn/filefusion

    FileFusion is a powerful file concatenation tool designed specifically for Large Language Model (LLM)

    Language:Go90
  • abdur75648/urdu-text-detection

    Text line detection for Urdu OCR (UTRNet)

    Language:Python6301
  • aws-samples/idp-invoice-automation-using-bedrock-data-automation-cdk

    Serverless Intelligent Document Processing (IDP) solution for invoice automation using Amazon Bedrock Data Automation. Features automated data extraction, annotation, and processing pipeline built with AWS services and CDK.

    Language:Python620
  • jayllfpt/table2html

    A Python package that converts table images into HTML format using Object Detection model and OCR.

    Language:Python6100
  • jeanbaptisteb/doccleaner

    A Python command-line utility intended for automating some copyediting tasks in documents. It allows editing zipped, XML-based files (e.g. docx, odt, or epub), through XSLT stylesheets. Can be rather easily extended with your own custom xsl stylesheets.

    Language:XSLT5252
  • AmadeusITGroup/docs2vecs

    CLI that helps with docs splitting, embedding and exposing them in a seamless manner

    Language:Python4377
  • CentralFloridaAttorney/zmongo_retriever

    zmongo_toolbag contains an easy to use MongoDB wrapper with a Langchain Vector Search Retriever implementation

    Language:Python4101
  • aws-samples/sample-for-multi-modal-document-to-json-with-sagemaker-ai

    This open-source project delivers a complete pipeline for converting multi-page documents (PDFs/images) into structured JSON using Vision LLMs on Amazon SageMaker. The solution leverages the SWIFT Framework to fine-tune models specifically for document understanding tasks.

    Language:Jupyter Notebook310
  • Gaurav4604/buddy

    Buddy is a Retrieval Augmented Generation based python toolkit, to help students studying STEM subjects

    Language:Python3101
  • Shahrom-S/BarsAI

    AI assistant

    Language:Python3200
  • SvenEichelsheimer/filegazer

    FileGazer - deep file analysing and categorisation

  • Huang-lab/figure-extractor

    Flask-based service using PDFFigures 2.0 to extract figures and tables from scholarly PDFs. Features REST API, CLI, Docker support, and JSON metadata output (~1.5s/page processing). Designed for document processing and RAG pipelines.

    Language:Python20
  • Jayanth-MKV/advanced-rag-cookbooks

    Advanced RAG Techniques and Projects

    Language:Jupyter Notebook2
  • kallebysantos/ocrlot

    A distributed ocr engine 🐆

    Language:Elixir20
  • mansi104-ai/BrevityAI

    BrevityAI 🚀simplifies understanding lengthy documents by generating concise, accurate summaries. Built with T5, Streamlit, and Python, it supports PDF and JSON inputs with customizable options for summary length and brevity.

    Language:Python220