document-processing

There are 57 repositories under document-processing topic.

  • dhlab-epfl/dhSegment

    Generic framework for historical document processing

    Language:Python3722850116
  • ExtractThinker

    enoch3712/ExtractThinker

    ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.

    Language:Python35882560
  • awslabs/project-lakechain

    :zap: Cloud-native, AI-powered, document processing pipelines on AWS.

    Language:TypeScript132113122
  • formkiq-core

    formkiq/formkiq-core

    A full-featured Document Layer for your application, providing the functionality of a flexible document management system, including storage, discovery, processing, and retrieval. Deploys directly into your Amazon Web Services Cloud. 🌟 Star to support our work!

    Language:Java111820817
  • steindani/pandoc-include

    An include filter for Pandoc

    Language:Haskell6151120
  • parsee-ai/parsee-core

    Retrieval of fully structured data made easy. Use LLMs or custom models. Specialized on PDFs and HTML files. Extensive support of tabular data extraction and multimodal queries.

    Language:Python54400
  • rhubarb

    awslabs/rhubarb

    A Python framework for multi-modal document understanding with Amazon Bedrock

    Language:Python48514
  • cburschka/lyx

    Unofficial mirror of git://git.lyx.org/lyx.git (updates daily. not affiliated with lyx.org.)

    Language:C++36807
  • aws-solutions/enhanced-document-understanding-on-aws

    Enhanced Document Understanding on AWS delivers an easy-to-use web application that ingests and analyzes documents, extracts content, identifies and redacts sensitive customer information, and creates search indexes from the analyzed data.

    Language:JavaScript34141513
  • afrozas/proceedings

    Semantic extraction from conference proceedings.

    Language:Python31001
  • kili-technology/awesome-datasets

    A comprehensive list of annotated training datasets classified by use case.

  • MBAigner/PDFSegmenter

    This library builds a graph-representation of the content of PDFs. The graph is then clustered, resulting page segments are classified and returned. Tables are retrieved formatted as a CSV.

    Language:Python22103
  • jmanhype/DSPy-Multi-Document-Agents

    An advanced distributed knowledge fabric for intelligent document processing, featuring multi-document agents, optimized query handling, and semantic understanding.

    Language:Python21202
  • greed2411/tokyo

    tokyo, a REST API, when given any type of document 📄, Identifies mime-type 🧐. Suggests extension 🦔. Alas Extracts text 💪.

    Language:Clojure18100
  • eklem/stopword-trainer

    A module for creating stopword lists for any language, based on a set of documents.

    Language:JavaScript141620
  • iamarunbrahma/pdf-to-markdown

    Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.

    Language:Python13301
  • jeanbaptisteb/doccleaner

    A Python command-line utility intended for automating some copyediting tasks in documents. It allows editing zipped, XML-based files (e.g. docx, odt, or epub), through XSLT stylesheets. Can be rather easily extended with your own custom xsl stylesheets.

    Language:XSLT6252
  • abdur75648/urdu-text-detection

    Text line detection for Urdu OCR (UTRNet)

    Language:Python5301
  • CentralFloridaAttorney/zmongo_retriever

    Use data from MongoDB in LangChain, Llama and OpenAI

    Language:Python4201
  • Shahrom-S/BarsAI

    AI assistant

    Language:Python3200
  • SvenEichelsheimer/filegazer

    FileGazer - deep file analysing and categorisation

  • caltechlibrary/popstar

    Phone-Oriented Processing SofTware for ARchives

    Language:Makefile2410
  • Jayanth-MKV/RAG-fastapi-chroma-langchain-docker

    This app uses FastAPI, Chroma, and Langchain to deliver real-time chat services with streaming responses. It employs RAG for enhanced interaction and is containerized with Docker for easy deployment.

    Language:Python220
  • jayllfpt/table2html

    A Python package that converts table images into HTML format using Object Detection model and OCR.

    Language:Python2
  • johnsirmon/clearcouncil

    ClearCouncil: Automated tools for collecting, organizing, and embedding publicly available local state county council documents (minutes, agendas) into LLMs. Python, JS, and wget scripts included for easy data retrieval and integration.

    Language:Python2261
  • Oneirocom/generative-intent-detection

    Generative intent detection with Magick

    Language:TypeScript2400
  • acsenrafilho/cucaracha

    A bureaucratic cockroach (cucaracha) assistent to help in document processing and analysis

    Language:Python10
  • anne27/Information-Retrieval

    An implementation of basic IR techniques from scratch.

    Language:Python1200
  • cemonal/Pdf2xNet

    Pdf2xNet is a .NET library for seamless integration with Xpdf tools, enabling easy conversion of PDF documents to text, images, and HTML formats within your .NET applications.

    Language:C#1100
  • dayang4321/MSc-Team-Project-CMPU9010-2023-24-Group-3

    TU Dublin Computer Science MSc. Final Project Group 3 - Accessibilator

    Language:Jupyter Notebook1100
  • eiceblue/Spire.Doc-for-C-

    Spire.Doc for C++ is a professional Word C++ library specifically designed for developers to create, read, write, convert, merge, split, and compare Word documents on any C++ platforms with fast and high-quality performance.

    Language:C++1101
  • joseferrerh/invoices-leanautomation

    This set of robots provides support for automatically obtaining information from invoices using docDigitizer API and keep track of the processed invoices on an Airtable repository

    Language:RobotFramework1110
  • m4nd0mb3/document-templater

    Document Templater is a powerful tool for automated document generation. Streamline the process of creating standard documents, such as contracts, reports, and forms, using predefined templates. This repository contains the source code for Document Templater, allowing you to easily integrate this functionality into your projects and automate docs.

    Language:JavaScript1300
  • thoth2357/Watermark-removal

    Program Helps remove watermark from a pdf document

    Language:Python1100
  • x1ao4/doc-merger

    通过 python 脚本将两个相对不完整的文档合并为一个完整的文档 / merge two relatively incomplete documents into one complete document via python script

    Language:Python1100