unstructured-data

There are 182 repositories under unstructured-data topic.

  • dvc

    iterative/dvc

    🦉 Data Versioning and ML Experiments

    Language:Python14.9k1344.8k1.3k
  • fiftyone

    voxel51/fiftyone

    Refine high-quality datasets and visual AI models

    Language:Python9.9k671.7k665
  • unstract

    Zipstack/unstract

    No-code LLM Platform to launch APIs and ETL Pipelines to structure unstructured documents

    Language:Python5.8k4759548
  • neo4j-labs/llm-graph-builder

    Neo4j graph construction from unstructured data using LLMs

    Language:Jupyter Notebook4k26567684
  • towhee-io/towhee

    Towhee is a framework that is dedicated to making neural data processing pipelines simple and fast.

    Language:Python3.4k29670262
  • instill-core

    instill-ai/instill-core

    🔮 Instill Core is a full-stack AI infrastructure tool for data, model and pipeline orchestration, designed to streamline every aspect of building versatile AI-first applications

    Language:Python2.3k29520121
  • milvus-io/bootcamp

    Dealing with all unstructured data, such as reverse image search, audio search, molecular search, video analysis, question and answer systems, NLP, etc.

    Language:Jupyter Notebook2.2k36268646
  • nomic-ai/nomic

    Interact, analyze and structure massive text, image, embedding, audio and video datasets

    Language:Python1.8k2966200
  • dingodb/dingo

    A multi-modal vector database that supports upserts and vector queries using unified SQL (MySQL-Compatible) on structured and unstructured data, while meeting the requirements of high concurrency and ultra-low latency.

    Language:Java1.7k16075265
  • tstanislawek/awesome-document-understanding

    A curated list of resources for Document Understanding (DU) topic

  • lotus-data/lotus

    Use LOTUS to process all of your datasets with LLMs and embeddings. Enjoy up to 1000x speedups with fast, accurate query processing, that's as simple as writing Pandas code

    Language:Python1.3k1558111
  • yobix-ai/extractous

    Fast and efficient unstructured data extraction. Written in Rust with bindings for many languages.

    Language:Rust1.2k184658
  • Renumics/spotlight

    Interactively explore unstructured datasets from your dataframe.

    Language:TypeScript1.2k189387
  • amphi-ai/amphi-etl

    Visual Data Preparation and Transformation. Low-Code Python-based ETL.

    Language:JavaScript1.1k1322774
  • lilac

    databricks/lilac

    Curate better data for LLMs

    Language:Python1k1329598
  • JSv4/OpenContracts

    Enterprise-grade and API-first LLM workspace for unstructured documents, including data extraction, redaction, rights management, prompt playground, and more!

    Language:Python82864576
  • nucliadb

    nuclia/nucliadb

    NucliaDB, The AI Search database for RAG

    Language:Python704181155
  • embedding_studio

    EulerSearch/embedding_studio

    Embedding Studio is a framework which allows you transform your Vector Database into a feature-rich Search Engine.

    Language:Python382665
  • harishdeivanayagam/rowfill

    Open-source unstructured data (PDFs, Images, Audiofiles) processing platform built for knowledge workers

    Language:TypeScript3638414
  • graphlit/graphlit-mcp-server

    Model Context Protocol (MCP) Server for Graphlit Platform

    Language:TypeScript3591021
  • garyelephant/pygrok

    python implementation of jordansissel's grok regular expression library

    Language:Python283153275
  • fzliu/radient

    Radient turns many data types (not just text) into vectors for similarity search, RAG, regression analysis, and more.

    Language:Python2803111
  • automorphic-ai/trex

    Enforce structured output from LLMs 100% of the time

    Language:Python249308
  • RelevanceAI/relevanceai

    Home of the AI workforce - Multi-agent system, AI agents & tools

    Language:Python22314932
  • velocitybolt/open-extract

    Structured Data Extractor for AI Agents. Search your documents or the web for specific data and get it back in JSON or Markdown in a single tool call.

    Language:Python1773020
  • DerwenAI/strwythura

    Construct knowledge graphs from unstructured data sources, use graph algorithms for enhanced GraphRAG with a DSPy-based chat bot locally, and curate semantics for optimizing AI app outcomes within a specific domain.

    Language:Jupyter Notebook1727321
  • wangxb96/RAG-QA-Generator

    RAG-QA-Generator 是一个用于检索增强生成(RAG)系统的自动化知识库构建与管理工具。该工具通过读取文档数据,利用大规模语言模型生成高质量的问答对(QA对),并将这些数据插入数据库中,实现RAG系统知识库的自动化构建和管理。

    Language:Python1501920
  • palimpzest

    mitdbg/palimpzest

    A System for Optimized Semantic Computation

    Language:Python14456425
  • CambioML/any-parser

    Accurate, private and configurable document retrieval LLM

    Language:Python1303011
  • jostmey/dkm

    Dynamic Kernel Matching (DKM) for Classifying Data with Non-conforming Features

    Language:HTML94406
  • BartJongejan/Bracmat

    Programming language for symbolic computation with unusual combination of pattern matching features: Tree patterns, associative patterns and expressions embedded in patterns.

    Language:C475125
  • IBM/pixiedust-facebook-analysis

    A Jupyter notebook that uses the Watson Visual Recognition and Natural Language Understanding services to enrich Facebook Analytics and uses Cognos Dashboard Embedded to explore and visualize the results in Watson Studio

    Language:Jupyter Notebook44152264
  • ScrapeGraphAI/Scrapontologies

    Python library for Entities, relationships and schemas extraction from documents

    Language:Python422163
  • instill-ai/console

    📺 Instill Console for 🔮 Instill Core: https://github.com/instill-ai/instill-core

    Language:TypeScript3911010
  • adansons/base

    Adansons Base is a data programming tool for error-analysis of training results. It organizes metadata of unstructured data and creates and organizes datasets. It makes dataset creation more effective and helps to find low-quality data by using the training results and improves AI performance.

    Language:Jupyter Notebook282523
  • instill-ai/pipeline-backend

    ⇋ A REST/gRPC server for Instill VDP API service

    Language:Go2712021