data-pipelines

There are 201 repositories under data-pipelines topic.

  • airflow

    apache/airflow

    Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

    Language:Python34.9k7549.1k13.7k
  • apache/dolphinscheduler

    Apache DolphinScheduler is the modern data orchestration platform. Agile to create high performance workflow with low-code

    Language:Java12.3k3317.3k4.5k
  • dagster-io/dagster

    An orchestration platform for the development, production, and observation of data assets.

    Language:Python10.5k1167k1.3k
  • infiniflow/ragflow

    RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.

    Language:Python8.7k52493802
  • mage-ai/mage-ai

    🧙 Build, run, and manage data pipelines for integrating and transforming data.

    Language:Python7.2k62682660
  • Unstructured-IO/unstructured

    Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

    Language:HTML7k49954530
  • orchest/orchest

    Build data pipelines, the easy way 🛠️

    Language:TypeScript4k43480250
  • fluvio

    infinyon/fluvio

    Lean and mean distributed stream processing system written in rust and web assembly.

    Language:Rust2.7k351.4k197
  • pathwaycom/pathway

    Python ETL framework for stream processing, real-time analytics, LLM pipelines, and RAG.

    Language:Python2.3k205387
  • elementary

    elementary-data/elementary

    The dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.

    Language:HTML1.8k9491146
  • meltano/meltano

    Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.

    Language:Python1.6k136.7k145
  • combust/mleap

    MLeap: Deploy ML Pipelines to Production

    Language:Scala1.5k69471312
  • SciPhi-AI/R2R

    The framework for fast development and deployment of RAG systems.

    Language:HTML1.3k1035112
  • data-engineering-wiki

    data-engineering-community/data-engineering-wiki

    The best place to learn data engineering. Built and maintained by the data engineering community.

    Language:CSS1.2k2424112
  • odd-platform

    opendatadiscovery/odd-platform

    First open-source data discovery and observability platform. We make a life for data practitioners easy so you can focus on your business.

    Language:Java1.1k1862693
  • dataform-co/dataform

    Dataform is a framework for managing SQL based data operations in BigQuery

    Language:TypeScript79620474148
  • raystack/optimus

    Optimus is an easy-to-use, reliable, and performant workflow orchestrator for data transformation, data modeling, pipelines, and data quality management.

    Language:Go73718268153
  • artie-labs/transfer

    Database replication platform that leverages change data capture. Stream production data from databases to your data warehouse (Snowflake, BigQuery, Redshift) in real-time.

    Language:Go54493524
  • versatile-data-kit

    vmware/versatile-data-kit

    One framework to develop, deploy and operate data workflows with Python and SQL.

    Language:Python4131694754
  • fmind/mlops-python-package

    Kickstart your MLOps initiative with a flexible, robust, and productive Python package.

    Language:Jupyter Notebook3648445
  • elementary-data/dbt-data-reliability

    dbt package that is part of Elementary, the dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.

    Language:Python34952078
  • recap

    recap-build/recap

    Work with your web service, database, and streaming schemas in a single format.

    Language:Python3061013324
  • dataplane-app/dataplane

    Dataplane is an Airflow inspired unified data platform with additional data mesh and RPA capability to automate, schedule and design data pipelines and workflows. Dataplane is written in Golang with a React front end.

    Language:JavaScript18863930
  • awesome-kubeflow

    terrytangyuan/awesome-kubeflow

    A curated list of awesome projects and resources related to Kubeflow (a CNCF incubating project)

  • kevin-hanselman/dud

    A lightweight CLI tool for versioning data alongside source code and building data pipelines.

    Language:Go1698646
  • datajoint-python

    datajoint/datajoint-python

    Relational data pipelines for the science lab

    Language:Python1631660183
  • tuva-health/tuva

    Main repo including core data model, data marts, reference data, terminology, and the clinical concept library

  • koolreport/core

    An Open Source PHP Reporting Framework that helps you to write perfect data reports or to construct awesome dashboards in PHP. Working great with all PHP versions from 5.6 to latest 8.0. Fully compatible with all kinds of MVC frameworks like Laravel, CodeIgniter, Symfony.

    Language:PHP15316534
  • GoogleCloudPlatform/public-datasets-pipelines

    Cloud-native, data onboarding architecture for Google Cloud Datasets

    Language:Python140154062
  • dataflint/spark

    Performance Observability for Apache Spark

    Language:TypeScript1291510
  • patterns-app/patterns-devkit

    Data pipelines from re-usable components

    Language:Python1064415
  • smart-data-lake/smart-data-lake

    Smart Automation Tool for building modern Data Lakes and Data Pipelines

    Language:Scala991331120
  • shravan-kuchkula/udacity-data-eng-proj-1

    Developed a data pipeline to automate data warehouse ETL by building custom airflow operators that handle the extraction, transformation, validation and loading of data from S3 -> Redshift -> S3

    Language:Python888058
  • confluentinc/learn-kafka-courses

    Learn the basics of Apache Kafka® from leaders in the Kafka community with these video courses covering the Kafka ecosystem and hands-on exercises.

    Language:Shell84147669
  • beneath-hq/beneath

    Beneath is a serverless real-time data platform ⚡️

    Language:Go81959
  • DataCater/datacater

    The developer-friendly ETL platform for transforming data in real-time. Based on Apache Kafka® and Kubernetes®.

    Language:JavaScript816833