data-pipeline

There are 1250 repositories under data-pipeline topic.

  • apache/shardingsphere

    Empowering Data Intelligence with Distributed SQL for Sharding, Scalability, and Security Across All Databases.

    Language:Java20.5k97411.7k6.9k
  • airbytehq/airbyte

    The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

    Language:Python20k18615.5k4.9k
  • debezium/debezium

    Change data capture for a variety of databases. Please log issues at https://issues.redhat.com/browse/DBZ.

    Language:Java12.1k21222.8k
  • snowplow

    snowplow/snowplow

    The leader in Customer Data Infrastructure

    Language:Scala7k2614k1.2k
  • apache/flink-cdc

    Flink CDC is a streaming data integration tool

    Language:Java6.3k1361.7k2.1k
  • datajuicer/data-juicer

    Data processing for and with foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷

    Language:Python5.5k18301288
  • rudder-server

    rudderlabs/rudder-server

    Privacy and Security focused Segment-alternative, in Golang and React

    Language:Go4.3k601469
  • adilkhash/Data-Engineering-HowTo

    A list of useful resources to learn Data Engineering from scratch

  • memphis

    superstreamlabs/memphis

    Memphis.dev is a highly scalable and effortless data streaming platform

    Language:Go3.4k32264230
  • bruin-data/ingestr

    ingestr is a CLI tool to copy data between any databases with a single command seamlessly.

    Language:Python3.3k1938113
  • dagu

    dagu-org/dagu

    A powerful, portable, local-first workflow engine for managing complex jobs without pain. Single binary with Web UI. 100% open source. No vendor lock-in. It natively supports running containers and executing commands over SSH. Offline or air-gapped environment ready.

    Language:Go2.8k21535210
  • whylabs/whylogs

    An open-source data logging library for machine learning models and data pipelines. 📚 Provides visibility into data quality & model performance over time. 🛡️ Supports privacy-preserving data collection, ensuring safety & robustness. 📈

    Language:Jupyter Notebook2.8k31433134
  • elementary

    elementary-data/elementary

    The dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.

    Language:HTML2.2k11657202
  • go-streams

    reugn/go-streams

    A lightweight stream processing library for Go

    Language:Go2.1k2542172
  • pydoit/doit

    CLI task management & automation tool

    Language:Python2k47306185
  • bytedance/bitsail

    BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every day.

    Language:Java1.7k58212333
  • multiwoven

    Multiwoven/multiwoven

    🔥🔥🔥 Open source Reverse ETL - alternative to hightouch and census.

    Language:Ruby1.6k144582
  • superlinked/superlinked

    Superlinked is a Python framework for AI Engineers building high-performance search & recommendation applications that combine structured and unstructured data.

    Language:Jupyter Notebook1.4k2857110
  • GoogleCloudPlatform/data-science-on-gcp

    Source code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017

    Language:Jupyter Notebook1.4k98104733
  • damklis/DataEngineeringProject

    Example end to end data engineering project.

    Language:Python1.3k166269
  • datazip-inc/olake

    Fastest open-source tool for replicating Databases to Data Lake in Open Table Formats like Apache Iceberg. ⚡ Efficient, quick and scalable data ingestion for real-time analytics. Supporting Postgres, MongoDB , MySQL and Oracle

    Language:Go1.2k8192139
  • klio

    spotify/klio

    Smarter data pipelines for audio.

    Language:Python86317654
  • covalent

    AgnostiqHQ/covalent

    Pythonic tool for orchestrating machine-learning/high performance/quantum-computing workflows in heterogeneous compute environments.

    Language:Python84523815106
  • apache/seatunnel-web

    SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).

    Language:Java748240327
  • ssp-data/practical-data-engineering

    Practical Data Engineering: A Hands-On Real-Estate Project Guide

    Language:Jupyter Notebook712104120
  • infoslack/awesome-kafka

    A list about Apache Kafka

  • conduit

    ConduitIO/conduit

    Conduit streams data between data stores. Kafka Connect replacement. No JVM required.

    Language:Go5651259855
  • piperider

    InfuseAI/piperider

    Code review for data in dbt

    Language:Python492117524
  • augraphy

    sparkfish/augraphy

    Augmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes

    Language:Python4741014159
  • 1kbgz/tributary

    Streaming reactive and dataflow graphs in Python

    Language:Python458148338
  • pracdata/awesome-open-source-data-engineering

    A curated list of open source tools used in analytics platforms and data engineering ecosystem

  • msamogh/nonechucks

    Deal with bad samples in your dataset dynamically, use Transforms as Filters, and more!

    Language:Python37823227
  • josephmachado/efficient_data_processing_spark

    Code for "Efficient Data Processing in Spark" Course

    Language:Python3461372
  • dataflint/spark

    Drop-in replacement for Apache Spark UI

    Language:TypeScript34151641
  • cuebook/cuelake

    Use SQL to build ELT pipelines on a data lakehouse.

    Language:JavaScript288112928
  • airscholar/e2e-data-engineering

    An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

    Language:Python28758132