unstructured-data

There are 203 repositories under unstructured-data topic.

iterative/dvc
🦉 Data Versioning and ML Experiments
Language:Python15.1k 130 4.8k1.3k
voxel51/fiftyone
Refine high-quality datasets and visual AI models
Language:Python10k 65 1.7k680
Zipstack/unstract
No-code LLM Platform to launch APIs and ETL Pipelines to structure unstructured documents
Language:Python5.9k 43 63564
neo4j-labs/llm-graph-builder
Neo4j graph construction from unstructured data using LLMs
Language:Jupyter Notebook4.1k 31 639732
towhee-io/towhee
Towhee is a framework that is dedicated to making neural data processing pipelines simple and fast.
Language:Python3.4k 28 670263
ucbepic/docetl
A system for agentic LLM-powered data processing and ETL
Language:Python3k 26 134318
instill-ai/instill-core
🔮 Instill Core is a full-stack AI infrastructure tool for data, model and pipeline orchestration, designed to streamline every aspect of building versatile AI-first applications
Language:Python2.3k 25 533122
milvus-io/bootcamp
Dealing with all unstructured data, such as reverse image search, audio search, molecular search, video analysis, question and answer systems, NLP, etc.
Language:Jupyter Notebook2.3k 35 269657
nomic-ai/nomic
Interact, analyze and structure massive text, image, embedding, audio and video datasets
Language:Python1.9k 27 67203
NanoNets/docext
An on-premises, OCR-free unstructured data extraction, markdown conversion and benchmarking toolkit. (https://idp-leaderboard.org/)
Language:Python1.8k 19 36134
shcherbak-ai/contextgem
ContextGem: Effortless LLM extraction from documents
Language:Python1.7k 13 17135
dingodb/dingo
A multi-modal vector database that supports upserts and vector queries using unified SQL (MySQL-Compatible) on structured and unstructured data, while meeting the requirements of high concurrency and ultra-low latency.
Language:Java1.7k 157 75264
yobix-ai/extractous
Fast and efficient unstructured data extraction. Written in Rust with bindings for many languages.
Language:Rust1.6k 17 4777
tstanislawek/awesome-document-understanding
A curated list of resources for Document Understanding (DU) topic
1.5k 36 2166
emcf/thepipe
Get clean data from tricky documents, powered by vision-language models ⚡
Language:Python1.5k 12 2694
lotus-data/lotus
AI-Powered Data Processing: Use LOTUS to process all of your datasets with LLMs and embeddings. Enjoy up to 1000x speedups with fast, accurate query processing, that's as simple as writing Pandas code
Language:Python1.3k 15 63115
amphi-ai/amphi-etl
Visual Data Preparation and Transformation. Low-Code Python-based ETL.
Language:TypeScript1.3k 14 25495
Renumics/spotlight
Interactively explore unstructured datasets from your dataframe.
Language:TypeScript1.2k 15 9485
databricks/lilac
Curate better data for LLMs
Language:Python1.1k 11 295102
Open-Source-Legal/OpenContracts
Enterprise-grade and API-first LLM workspace for unstructured documents, including data extraction, redaction, rights management, prompt playground, and more!
Language:TypeScript952 5 145101
nuclia/nucliadb
NucliaDB, The AI Search database for RAG
Language:Python709 18 1159
EulerSearch/embedding_studio
Embedding Studio is a framework which allows you transform your Vector Database into a feature-rich Search Engine.
Language:Python381 6 65
graphlit/graphlit-mcp-server
Model Context Protocol (MCP) Server for Graphlit Platform
Language:TypeScript369 3 149
harishdeivanayagam/rowfill
Open-source spreadsheets platform for deep research and document processing
Language:TypeScript364 8 421
garyelephant/pygrok
python implementation of jordansissel's grok regular expression library
Language:Python281 15 3275
fzliu/radient
Radient turns many data types (not just text) into vectors for similarity search, RAG, regression analysis, and more.
Language:Python280 3 111
RelevanceAI/relevanceai
Home of the AI workforce - Multi-agent system, AI agents & tools
Language:Python258 12 1044
automorphic-ai/trex
Enforce structured output from LLMs 100% of the time
Language:Python248 3 08
HKUSTDial/awesome-data-agents
Continuously updated paper list on advancements in Data Agents. Companion repo to our paper "A Survey of Data Agents: Emerging Paradigm or Overstated Hype?"
Language:Python24511
wangxb96/RAG-QA-Generator
RAG-QA-Generator 是一个用于检索增强生成（RAG）系统的自动化知识库构建与管理工具。该工具通过读取文档数据，利用大规模语言模型生成高质量的问答对（QA对），并将这些数据插入数据库中，实现RAG系统知识库的自动化构建和管理。
Language:Python243 2 1128
DerwenAI/strwythura
Construct knowledge graphs from unstructured data sources, use graph algorithms for enhanced GraphRAG with a DSPy-based chat bot locally, and curate semantics for optimizing AI app outcomes within a specific domain.
Language:Jupyter Notebook179 7 320
velocitybolt/open-extract
Structured Data Extractor for AI Agents. Search your documents or the web for specific data and get it back in JSON or Markdown in a single tool call.
Language:Python179 3 021
mitdbg/palimpzest
A System for Optimized Semantic Computation
Language:Python161 5 6530
CambioML/any-parser
Accurate, private and configurable document retrieval LLM
Language:Python130 3 014
SolidLao/GPTuner
GPTuner is a manual-reading database tuning system leveraging domain knowlege automatically and extensively to enhance knob tuning process.
Language:Python119 4 522
jostmey/dkm
Dynamic Kernel Matching (DKM) for Classifying Data with Non-conforming Features
Language:HTML94 4 07

unstructured-data

iterative/dvc

voxel51/fiftyone

Zipstack/unstract

neo4j-labs/llm-graph-builder

towhee-io/towhee

ucbepic/docetl

instill-ai/instill-core

milvus-io/bootcamp

nomic-ai/nomic

NanoNets/docext

shcherbak-ai/contextgem

dingodb/dingo

yobix-ai/extractous

tstanislawek/awesome-document-understanding

emcf/thepipe

lotus-data/lotus

amphi-ai/amphi-etl

Renumics/spotlight

databricks/lilac

Open-Source-Legal/OpenContracts

nuclia/nucliadb

EulerSearch/embedding_studio

graphlit/graphlit-mcp-server

harishdeivanayagam/rowfill

garyelephant/pygrok

fzliu/radient

RelevanceAI/relevanceai

automorphic-ai/trex

HKUSTDial/awesome-data-agents

wangxb96/RAG-QA-Generator

DerwenAI/strwythura

velocitybolt/open-extract

mitdbg/palimpzest

CambioML/any-parser

SolidLao/GPTuner

jostmey/dkm