unstructured-data
There are 203 repositories under unstructured-data topic.
iterative/dvc
🦉 Data Versioning and ML Experiments
voxel51/fiftyone
Refine high-quality datasets and visual AI models
Zipstack/unstract
No-code LLM Platform to launch APIs and ETL Pipelines to structure unstructured documents
neo4j-labs/llm-graph-builder
Neo4j graph construction from unstructured data using LLMs
towhee-io/towhee
Towhee is a framework that is dedicated to making neural data processing pipelines simple and fast.
ucbepic/docetl
A system for agentic LLM-powered data processing and ETL
instill-ai/instill-core
🔮 Instill Core is a full-stack AI infrastructure tool for data, model and pipeline orchestration, designed to streamline every aspect of building versatile AI-first applications
milvus-io/bootcamp
Dealing with all unstructured data, such as reverse image search, audio search, molecular search, video analysis, question and answer systems, NLP, etc.
nomic-ai/nomic
Interact, analyze and structure massive text, image, embedding, audio and video datasets
NanoNets/docext
An on-premises, OCR-free unstructured data extraction, markdown conversion and benchmarking toolkit. (https://idp-leaderboard.org/)
shcherbak-ai/contextgem
ContextGem: Effortless LLM extraction from documents
dingodb/dingo
A multi-modal vector database that supports upserts and vector queries using unified SQL (MySQL-Compatible) on structured and unstructured data, while meeting the requirements of high concurrency and ultra-low latency.
yobix-ai/extractous
Fast and efficient unstructured data extraction. Written in Rust with bindings for many languages.
tstanislawek/awesome-document-understanding
A curated list of resources for Document Understanding (DU) topic
emcf/thepipe
Get clean data from tricky documents, powered by vision-language models ⚡
lotus-data/lotus
AI-Powered Data Processing: Use LOTUS to process all of your datasets with LLMs and embeddings. Enjoy up to 1000x speedups with fast, accurate query processing, that's as simple as writing Pandas code
amphi-ai/amphi-etl
Visual Data Preparation and Transformation. Low-Code Python-based ETL.
Renumics/spotlight
Interactively explore unstructured datasets from your dataframe.
databricks/lilac
Curate better data for LLMs
Open-Source-Legal/OpenContracts
Enterprise-grade and API-first LLM workspace for unstructured documents, including data extraction, redaction, rights management, prompt playground, and more!
nuclia/nucliadb
NucliaDB, The AI Search database for RAG
EulerSearch/embedding_studio
Embedding Studio is a framework which allows you transform your Vector Database into a feature-rich Search Engine.
graphlit/graphlit-mcp-server
Model Context Protocol (MCP) Server for Graphlit Platform
harishdeivanayagam/rowfill
Open-source spreadsheets platform for deep research and document processing
garyelephant/pygrok
python implementation of jordansissel's grok regular expression library
fzliu/radient
Radient turns many data types (not just text) into vectors for similarity search, RAG, regression analysis, and more.
RelevanceAI/relevanceai
Home of the AI workforce - Multi-agent system, AI agents & tools
automorphic-ai/trex
Enforce structured output from LLMs 100% of the time
HKUSTDial/awesome-data-agents
Continuously updated paper list on advancements in Data Agents. Companion repo to our paper "A Survey of Data Agents: Emerging Paradigm or Overstated Hype?"
wangxb96/RAG-QA-Generator
RAG-QA-Generator 是一个用于检索增强生成(RAG)系统的自动化知识库构建与管理工具。该工具通过读取文档数据,利用大规模语言模型生成高质量的问答对(QA对),并将这些数据插入数据库中,实现RAG系统知识库的自动化构建和管理。
DerwenAI/strwythura
Construct knowledge graphs from unstructured data sources, use graph algorithms for enhanced GraphRAG with a DSPy-based chat bot locally, and curate semantics for optimizing AI app outcomes within a specific domain.
velocitybolt/open-extract
Structured Data Extractor for AI Agents. Search your documents or the web for specific data and get it back in JSON or Markdown in a single tool call.
mitdbg/palimpzest
A System for Optimized Semantic Computation
CambioML/any-parser
Accurate, private and configurable document retrieval LLM
SolidLao/GPTuner
GPTuner is a manual-reading database tuning system leveraging domain knowlege automatically and extensively to enhance knob tuning process.
jostmey/dkm
Dynamic Kernel Matching (DKM) for Classifying Data with Non-conforming Features