data-pipelines

There are 296 repositories under data-pipelines topic.

  • pathway

    pathwaycom/pathway

    Python ETL framework for stream processing, real-time analytics, LLM pipelines, and RAG.

    Language:Python43.4k47831.3k
  • airflow

    apache/airflow

    Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

    Language:Python42.4k76312.6k15.6k
  • dagster-io/dagster

    An orchestration platform for the development, production, and observation of data assets.

    Language:Python14k1238.1k1.8k
  • apache/dolphinscheduler

    Apache DolphinScheduler is the modern data orchestration platform. Agile to create high performance workflow with low-code

    Language:Java13.8k3257.8k4.9k
  • Unstructured-IO/unstructured

    Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.

    Language:HTML12.7k681.2k1k
  • mage-ai/mage-ai

    🧙 Build, run, and manage data pipelines for integrating and transforming data.

    Language:Python8.5k64987868
  • fluvio

    infinyon/fluvio

    🦀 event stream processing for developers to collect and transform data in motion to power responsive data intensive applications.

    Language:Rust5k461.6k516
  • preswald

    StructuredLabs/preswald

    Preswald is a WASM packager for Python-based interactive data apps: bundle full complex data workflows, particularly visualizations, into single files, runnable completely in-browser, using Pyodide, DuckDB, Pandas, and Plotly, Matplotlib, etc. Build dashboards, reports, and notebooks that run offline, load fast, and share like a document.

    Language:Python4.3k281664
  • orchest/orchest

    Build data pipelines, the easy way 🛠️

    Language:TypeScript4.1k43481263
  • Netflix/maestro

    Maestro: Netflix’s Workflow Orchestrator

    Language:Java3.5k16964234
  • ucbepic/docetl

    A system for agentic LLM-powered data processing and ETL

    Language:Python2.8k18111302
  • meltano/meltano

    Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.

    Language:Python2.2k136.7k175
  • elementary

    elementary-data/elementary

    The dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.

    Language:HTML2.2k12610197
  • data-engineering-wiki

    data-engineering-community/data-engineering-wiki

    The best place to learn data engineering. Built and maintained by the data engineering community.

    Language:CSS1.8k2838214
  • feldera

    feldera/feldera

    The Feldera Incremental Computation Engine

    Language:Rust1.6k111.5k76
  • combust/mleap

    MLeap: Deploy ML Pipelines to Production

    Language:Scala1.5k65474314
  • pyper-dev/pyper

    Concurrent Python made simple

    Language:Python1.5k2730
  • odd-platform

    opendatadiscovery/odd-platform

    First open-source data discovery and observability platform. We make a life for data practitioners easy so you can focus on your business.

    Language:Java1.3k18645128
  • fmind/mlops-python-package

    Kickstart your MLOps initiative with a flexible, robust, and productive Python package.

    Language:Jupyter Notebook1.3k1423196
  • OpenDCAI/DataFlow

    Easy Data Preparation with latest LLMs-based Operators and Pipelines.

    Language:Python1.3k163784
  • yobix-ai/extractous

    Fast and efficient unstructured data extraction. Written in Rust with bindings for many languages.

    Language:Rust1.2k143843
  • amphi-ai/amphi-etl

    Visual Data Preparation and Transformation. Low-Code Python-based ETL.

    Language:JavaScript1.1k1322774
  • bruin-data/bruin

    Build data pipelines with SQL and Python, ingest data from different sources, add quality checks, and build end-to-end flows.

    Language:Go9798934
  • dataform-co/dataform

    Dataform is a framework for managing SQL based data operations in BigQuery

    Language:TypeScript92022536186
  • raystack/optimus

    Optimus is an easy-to-use, reliable, and performant workflow orchestrator for data transformation, data modeling, pipelines, and data quality management.

    Language:Go75216268153
  • artie-labs/transfer

    Database replication platform that leverages change data capture. Stream production data from databases to your data warehouse (Snowflake, BigQuery, Redshift, Databricks) in real-time.

    Language:Go67194440
  • elementary-data/dbt-data-reliability

    dbt package that is part of Elementary, the dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.

    Language:Python461726113
  • versatile-data-kit

    vmware/versatile-data-kit

    One framework to develop, deploy and operate data workflows with Python and SQL.

    Language:Python4591594760
  • recap

    gabledata/recap

    Work with your web service, database, and streaming schemas in a single format.

    Language:Python3431013326
  • dataflint/spark

    Drop-in replacement for Apache Spark UI

    Language:TypeScript30741025
  • tuva-health/tuva

    Main repo including core data model, data marts, data quality tests, and terminology sets.

    Language:HTML2721323295
  • dataplane-app/dataplane

    Dataplane is an Airflow inspired unified data platform with additional data mesh and RPA capability to automate, schedule and design data pipelines and workflows. Dataplane is written in Golang with a React front end.

    Language:JavaScript23063933
  • awesome-kubeflow

    terrytangyuan/awesome-kubeflow

    A curated list of awesome projects and resources related to Kubeflow (a CNCF incubating project)

  • kevin-hanselman/dud

    A lightweight CLI tool for versioning data alongside source code and building data pipelines.

    Language:Go2128718
  • datajoint-python

    datajoint/datajoint-python

    Relational data pipelines for the science lab

    Language:Python1841862986
  • koolreport/core

    An Open Source PHP Reporting Framework that helps you to write perfect data reports or to construct awesome dashboards in PHP. Working great with all PHP versions from 5.6 to latest 8.0. Fully compatible with all kinds of MVC frameworks like Laravel, CodeIgniter, Symfony.

    Language:PHP17615734