unstructured-data

There are 156 repositories under unstructured-data topic.

  • dvc

    iterative/dvc

    🦉 Data Versioning and ML Experiments

    Language:Python14.1k1354.7k1.2k
  • fiftyone

    voxel51/fiftyone

    Refine high-quality datasets and visual AI models

    Language:Python9.1k631.6k589
  • unstract

    Zipstack/unstract

    No-code LLM Platform to launch APIs and ETL Pipelines to structure unstructured documents

    Language:Python3.3k2531218
  • towhee-io/towhee

    Towhee is a framework that is dedicated to making neural data processing pipelines simple and fast.

    Language:Python3.3k29667255
  • neo4j-labs/llm-graph-builder

    Neo4j graph construction from unstructured data using LLMs

    Language:Jupyter Notebook2.8k23504457
  • instill-core

    instill-ai/instill-core

    🔮 Instill Core is a full-stack AI infrastructure tool for data, model and pipeline orchestration, designed to streamline every aspect of building versatile AI-first applications

    Language:Makefile2.2k31519109
  • milvus-io/bootcamp

    Dealing with all unstructured data, such as reverse image search, audio search, molecular search, video analysis, question and answer systems, NLP, etc.

    Language:HTML2k35264601
  • dingodb/dingo

    A multi-modal vector database that supports upserts and vector queries using unified SQL (MySQL-Compatible) on structured and unstructured data, while meeting the requirements of high concurrency and ultra-low latency.

    Language:Java1.7k16475271
  • nomic-ai/nomic

    Interact, analyze and structure massive text, image, embedding, audio and video datasets

    Language:Python1.4k2765176
  • tstanislawek/awesome-document-understanding

    A curated list of resources for Document Understanding (DU) topic

  • Renumics/spotlight

    Interactively explore unstructured datasets from your dataframe.

    Language:TypeScript1.1k199383
  • lilac

    databricks/lilac

    Curate better data for LLMs

    Language:Python9981429395
  • amphi-ai/amphi-etl

    Visual Data Transformation and Data Preparation. Low-Code Python-based ETL.

    Language:TypeScript9611220649
  • nucliadb

    nuclia/nucliadb

    NucliaDB, The AI Search database for RAG

    Language:Python674191150
  • yobix-ai/extractous

    Fast and efficient unstructured data extraction. Written in Rust with bindings for many languages.

    Language:Rust654112627
  • embedding_studio

    EulerSearch/embedding_studio

    Embedding Studio is a framework which allows you transform your Vector Database into a feature-rich Search Engine.

    Language:Python379665
  • garyelephant/pygrok

    python implementation of jordansissel's grok regular expression library

    Language:Python275163274
  • fzliu/radient

    Radient turns many data types (not just text) into vectors for similarity search, RAG, regression analysis, and more.

    Language:Python2733111
  • automorphic-ai/trex

    Enforce structured output from LLMs 100% of the time

    Language:Python243308
  • RelevanceAI/relevanceai

    Home of the AI workforce - Multi-agent system, AI agents & tools

    Language:Python18512927
  • marly-ai/marly

    Context-aware structured outputs. Search your documents or the web for specific data and get it back in JSON or Markdown.

    Language:Python1384010
  • CambioML/any-parser

    Accurate, private and configurable document retrieval LLM

    Language:Python133308
  • DerwenAI/strwythura

    How to construct knowledge graphs from unstructured data sources

    Language:Jupyter Notebook986310
  • jostmey/dkm

    Dynamic Kernel Matching (DKM) for Classifying Data with Non-conforming Features

    Language:HTML95506
  • wangxb96/RAG-QA-Generator

    RAG-QA-Generator 是一个用于检索增强生成(RAG)系统的自动化知识库构建与管理工具。该工具通过读取文档数据,利用大规模语言模型生成高质量的问答对(QA对),并将这些数据插入数据库中,实现RAG系统知识库的自动化构建和管理。

    Language:Python74226
  • BartJongejan/Bracmat

    Programming language for symbolic computation with unusual combination of pattern matching features: Tree patterns, associative patterns and expressions embedded in patterns.

    Language:C476125
  • IBM/pixiedust-facebook-analysis

    A Jupyter notebook that uses the Watson Visual Recognition and Natural Language Understanding services to enrich Facebook Analytics and uses Cognos Dashboard Embedded to explore and visualize the results in Watson Studio

    Language:Jupyter Notebook44172264
  • instill-ai/console

    📺 Instill Console for 🔮 Instill Core: https://github.com/instill-ai/instill-core

    Language:TypeScript3712010
  • ScrapeGraphAI/Scrapontologies

    Python library for Entities, relationships and schemas extraction from documents

    Language:Python352161
  • adansons/base

    Adansons Base is a data programming tool for error-analysis of training results. It organizes metadata of unstructured data and creates and organizes datasets. It makes dataset creation more effective and helps to find low-quality data by using the training results and improves AI performance.

    Language:Jupyter Notebook282523
  • chaitjo/knowledge-graphs

    Building Knowledge Graphs from Unstructured Text

    Language:Jupyter Notebook26217
  • instill-ai/pipeline-backend

    ⇋ A REST/gRPC server for Instill VDP API service

    Language:Go2615021
  • cli

    instill-ai/cli

    ⌨️ Instill CLI for 🔮 Instill Core: https://github.com/instill-ai/instill-core

    Language:Go221303
  • instill-ai/deprecated-model

    ⚗️ Instill Model contains components for AI model orchestration

    Language:Makefile20804
  • Zipstack/unstract-adapters

    Unstract's interface to LLMs, Embeddings and VectorDBs.

    Language:Python18303
  • osllmai/inDox

    Indox is an advanced search and retrieval technique that efficiently extracts data from diverse document types, including PDFs and HTML, using online or offline large language models such as Openai, Hugging Face , etc.

    Language:Jupyter Notebook16152