unstructured-data

There are 203 repositories under unstructured-data topic.

  • dvc

    iterative/dvc

    🦉 Data Versioning and ML Experiments

    Language:Python15.1k1304.8k1.3k
  • fiftyone

    voxel51/fiftyone

    Refine high-quality datasets and visual AI models

    Language:Python10k651.7k680
  • unstract

    Zipstack/unstract

    No-code LLM Platform to launch APIs and ETL Pipelines to structure unstructured documents

    Language:Python5.9k4363564
  • neo4j-labs/llm-graph-builder

    Neo4j graph construction from unstructured data using LLMs

    Language:Jupyter Notebook4.1k31639732
  • towhee-io/towhee

    Towhee is a framework that is dedicated to making neural data processing pipelines simple and fast.

    Language:Python3.4k28670263
  • ucbepic/docetl

    A system for agentic LLM-powered data processing and ETL

    Language:Python3k26134318
  • instill-core

    instill-ai/instill-core

    🔮 Instill Core is a full-stack AI infrastructure tool for data, model and pipeline orchestration, designed to streamline every aspect of building versatile AI-first applications

    Language:Python2.3k25533122
  • milvus-io/bootcamp

    Dealing with all unstructured data, such as reverse image search, audio search, molecular search, video analysis, question and answer systems, NLP, etc.

    Language:Jupyter Notebook2.3k35269657
  • nomic-ai/nomic

    Interact, analyze and structure massive text, image, embedding, audio and video datasets

    Language:Python1.9k2767203
  • NanoNets/docext

    An on-premises, OCR-free unstructured data extraction, markdown conversion and benchmarking toolkit. (https://idp-leaderboard.org/)

    Language:Python1.8k1936134
  • contextgem

    shcherbak-ai/contextgem

    ContextGem: Effortless LLM extraction from documents

    Language:Python1.7k1317135
  • dingodb/dingo

    A multi-modal vector database that supports upserts and vector queries using unified SQL (MySQL-Compatible) on structured and unstructured data, while meeting the requirements of high concurrency and ultra-low latency.

    Language:Java1.7k15775264
  • yobix-ai/extractous

    Fast and efficient unstructured data extraction. Written in Rust with bindings for many languages.

    Language:Rust1.6k174777
  • tstanislawek/awesome-document-understanding

    A curated list of resources for Document Understanding (DU) topic

  • emcf/thepipe

    Get clean data from tricky documents, powered by vision-language models ⚡

    Language:Python1.5k122694
  • lotus-data/lotus

    AI-Powered Data Processing: Use LOTUS to process all of your datasets with LLMs and embeddings. Enjoy up to 1000x speedups with fast, accurate query processing, that's as simple as writing Pandas code

    Language:Python1.3k1563115
  • amphi-ai/amphi-etl

    Visual Data Preparation and Transformation. Low-Code Python-based ETL.

    Language:TypeScript1.3k1425495
  • Renumics/spotlight

    Interactively explore unstructured datasets from your dataframe.

    Language:TypeScript1.2k159485
  • lilac

    databricks/lilac

    Curate better data for LLMs

    Language:Python1.1k11295102
  • Open-Source-Legal/OpenContracts

    Enterprise-grade and API-first LLM workspace for unstructured documents, including data extraction, redaction, rights management, prompt playground, and more!

    Language:TypeScript9525145101
  • nucliadb

    nuclia/nucliadb

    NucliaDB, The AI Search database for RAG

    Language:Python709181159
  • embedding_studio

    EulerSearch/embedding_studio

    Embedding Studio is a framework which allows you transform your Vector Database into a feature-rich Search Engine.

    Language:Python381665
  • graphlit/graphlit-mcp-server

    Model Context Protocol (MCP) Server for Graphlit Platform

    Language:TypeScript3693149
  • harishdeivanayagam/rowfill

    Open-source spreadsheets platform for deep research and document processing

    Language:TypeScript3648421
  • garyelephant/pygrok

    python implementation of jordansissel's grok regular expression library

    Language:Python281153275
  • fzliu/radient

    Radient turns many data types (not just text) into vectors for similarity search, RAG, regression analysis, and more.

    Language:Python2803111
  • RelevanceAI/relevanceai

    Home of the AI workforce - Multi-agent system, AI agents & tools

    Language:Python258121044
  • automorphic-ai/trex

    Enforce structured output from LLMs 100% of the time

    Language:Python248308
  • HKUSTDial/awesome-data-agents

    Continuously updated paper list on advancements in Data Agents. Companion repo to our paper "A Survey of Data Agents: Emerging Paradigm or Overstated Hype?"

    Language:Python24511
  • wangxb96/RAG-QA-Generator

    RAG-QA-Generator 是一个用于检索增强生成(RAG)系统的自动化知识库构建与管理工具。该工具通过读取文档数据,利用大规模语言模型生成高质量的问答对(QA对),并将这些数据插入数据库中,实现RAG系统知识库的自动化构建和管理。

    Language:Python24321128
  • DerwenAI/strwythura

    Construct knowledge graphs from unstructured data sources, use graph algorithms for enhanced GraphRAG with a DSPy-based chat bot locally, and curate semantics for optimizing AI app outcomes within a specific domain.

    Language:Jupyter Notebook1797320
  • velocitybolt/open-extract

    Structured Data Extractor for AI Agents. Search your documents or the web for specific data and get it back in JSON or Markdown in a single tool call.

    Language:Python1793021
  • palimpzest

    mitdbg/palimpzest

    A System for Optimized Semantic Computation

    Language:Python16156530
  • CambioML/any-parser

    Accurate, private and configurable document retrieval LLM

    Language:Python1303014
  • SolidLao/GPTuner

    GPTuner is a manual-reading database tuning system leveraging domain knowlege automatically and extensively to enhance knob tuning process.

    Language:Python1194522
  • jostmey/dkm

    Dynamic Kernel Matching (DKM) for Classifying Data with Non-conforming Features

    Language:HTML94407