document-extraction

There are 33 repositories under document-extraction topic.

  • DocumindHQ/documind

    Open-source platform for extracting structured data from documents using AI.

    Language:JavaScript1.5k141060
  • harishdeivanayagam/rowfill

    Open-source spreadsheets platform for deep research and document processing

    Language:TypeScript3648421
  • Xyntopia/pydoxtools

    Effortlessly extract information from unstructured data with this library, utilizing advanced AI techniques. Compose AI in customizable pipelines and diverse sources for your projects.

    Language:Python865214
  • FantDing/Image-document-extract-and-correction

    数字图像课程大作业,实现图片中文档提取与矫正。整体思路是通过hough变换检测出直线,进而得到角点,最后经过投影变换,进行矫正。整个项目只用到了opencv的IO操作(包括手写卷积,hough哈夫变换,投影变换等等)

    Language:Python772319
  • alephdata/ingest-file

    Ingestors extract the contents of mixed unstructured documents into structured (followthemoney) data.

    Language:Python661033
  • ryanmcdonough/lexplore

    Tool to allow extraction of data from legal documents

    Language:Python13102
  • Tammilore/ai-contract-analyzer

    AI-powered contract analysis tool

    Language:TypeScript1010
  • dev-luckymhz/AIVisionText-invoice-OCR-typescript

    AIVisionText is an advanced document analysis platform that harnesses the power of artificial intelligence (AI) to revolutionize the way you manage and extract insights from documents.

    Language:TypeScript5102
  • jamesmcroft/ai-document-data-extraction-evaluation

    This project demonstrates how to evaluate the use of LLMs and SLMs for extracting structured data from documents using .NET

    Language:C#5102
  • jamesmcroft/document-data-extraction-prompt-flow-evaluation

    This sample demonstrates how to use GPT-4o with Vision to extract structured JSON data from PDF documents and evaluate them with Azure AI Studio and Prompt Flow

    Language:Bicep5103
  • jamesmcroft/azure-ai-document-pipeline-python-sample

    Python-based Durable Functions accelerator for building intelligent document processing pipelines with Azure AI Services on Azure Container Apps

    Language:Bicep4116
  • openaleph/ingest-file

    Ingestors extract the contents of mixed unstructured documents into structured (followthemoney) data.

    Language:Python4
  • dashroshan/data-extractor

    Extract and download key-value pairs, tables, and paragraphs from your scanned pdf, jpg, and png documents as CSV files.

    Language:JavaScript210
  • docuglean-ai/docuglean-ocr

    Intelligent document processing. Extract structured data like JSON, Markdown and HTML from documents using AI.

    Language:Python2
  • jamesmcroft/azure-ai-document-pipeline-sample

    .NET sample project for building a scalable document data extraction pipeline with containerized Durable Functions and Azure AI Services on Azure Container Apps.

    Language:C#210
  • sensible-hq/tutorial-pdf-to-excel

    Converts a PDF file to Excel.

    Language:Python2001
  • EloiRamos/dolphin-doc-extractor

    AI-powered document intelligence platform that extracts structured data from PDFs, Word docs, and images using Large Language Models and Tesseract OCR.

    Language:Python1
  • iLejuxepWaduzd/structured-data-extractor

    🛠️ Extract structured data from messy texts using Chain-of-Thought prompting to improve processing of customer support and technical issues.

    Language:C#1
  • jojolebarjos/pdf2htmlEX-webservice

    pdf2htmlEX as a webservice

    Language:Dockerfile1200
  • mithgx/UnstructData

    UnstructData is a Python toolkit for extracting, transforming, and analyzing unstructured data from diverse sources like text files, logs, and documents. Key features include flexible preprocessing, data cleaning, feature extraction, and extensible utilities—ideal for streamlining messy data workflows.

    Language:Python1
  • PMTheTechGuy/document-entity-extractor

    AI-powered document extractor for names, emails, and organizations. License: MIT

    Language:Python1
  • subratamondal1/document-extraction

    Document extraction from pdfs and images with OpenCV.

    Language:Python1100
  • idstack/extractor

    Extractor API for document extraction with the use of DocParser

    Language:Java0410
  • Ritesh1137/langchain-doc-intelligence-loader

    Customized LangChain Azure Document Intelligence loader for table extraction and summarization

    Language:Python0100
  • ThinkOrFaust/QuickZonalOCR

    Welcome to QuickZonalOCR! Right now, it's a work in progress, but the goal is to make creating your own key-value document extraction models fairly easily. Think of it as your friendly tool-in-the-making for smart, hassle-free ML model creation. Stay tuned for updates!

    Language:HTML0110
  • AbdulmalikAlayande/docuguardai

    ClarityDocs is an AI-powered platform that ingests documents (PDF, DOCX), extracts key requirements, and matches them with regulatory frameworks using NLP and vector search (LangChain + Qdrant). Designed to streamline compliance reviews and ensure documents meet industry standards effortlessly.

    Language:Java
  • AI-Enginner/Intelligent-Document-Processing

    AI-powered data extraction tool that converts PDFs, images, and scanned documents into structured data in seconds.

  • AI-Enginner/Invoice-Data-Extraction

    Extract Data from Invoices with AI. Stop wasting hours on manual invoice data entry. Simply say what data you need and let AI do the work

  • dataiku/dss-plugin-nlp-extraction

    WORK IN PROGRESS - Dataiku DSS plugin to extract text data from documents

    Language:Makefile50
  • hreikin/pdf-toolbox

    Extract content from PDF's and convert or create new documents from the content in multiple output formats.

    Language:Python111
  • JunoLeong/RAG-DocExtractRAG

    DocExtractRAG is a Retrieval-Augmented Generation (RAG) system that combines the power of large language models (LLMs) with document retrieval to provide insightful responses based on academic or other types of documents. The system utilizes the Zephyr-7B-beta model for text generation; BAAI/bge-large-en for document embeddings.

    Language:Python
  • NDarayut/Agentic-Document-Intelligent-System

    Agentic Document System that allow user to chat with their image, or pdf document with reference and bounding box to the original text.

    Language:HTML
  • rajsinghparihar/data-detective

    An app that leverages LLMs to process documents, extract relevant information and provide a summary specific to financial data

    Language:Python10