data-pipeline
There are 1250 repositories under data-pipeline topic.
apache/shardingsphere
Empowering Data Intelligence with Distributed SQL for Sharding, Scalability, and Security Across All Databases.
airbytehq/airbyte
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
debezium/debezium
Change data capture for a variety of databases. Please log issues at https://issues.redhat.com/browse/DBZ.
snowplow/snowplow
The leader in Customer Data Infrastructure
apache/flink-cdc
Flink CDC is a streaming data integration tool
datajuicer/data-juicer
Data processing for and with foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷
rudderlabs/rudder-server
Privacy and Security focused Segment-alternative, in Golang and React
adilkhash/Data-Engineering-HowTo
A list of useful resources to learn Data Engineering from scratch
superstreamlabs/memphis
Memphis.dev is a highly scalable and effortless data streaming platform
bruin-data/ingestr
ingestr is a CLI tool to copy data between any databases with a single command seamlessly.
dagu-org/dagu
A powerful, portable, local-first workflow engine for managing complex jobs without pain. Single binary with Web UI. 100% open source. No vendor lock-in. It natively supports running containers and executing commands over SSH. Offline or air-gapped environment ready.
whylabs/whylogs
An open-source data logging library for machine learning models and data pipelines. 📚 Provides visibility into data quality & model performance over time. 🛡️ Supports privacy-preserving data collection, ensuring safety & robustness. 📈
elementary-data/elementary
The dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.
reugn/go-streams
A lightweight stream processing library for Go
pydoit/doit
CLI task management & automation tool
bytedance/bitsail
BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every day.
Multiwoven/multiwoven
🔥🔥🔥 Open source Reverse ETL - alternative to hightouch and census.
superlinked/superlinked
Superlinked is a Python framework for AI Engineers building high-performance search & recommendation applications that combine structured and unstructured data.
GoogleCloudPlatform/data-science-on-gcp
Source code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017
damklis/DataEngineeringProject
Example end to end data engineering project.
datazip-inc/olake
Fastest open-source tool for replicating Databases to Data Lake in Open Table Formats like Apache Iceberg. ⚡ Efficient, quick and scalable data ingestion for real-time analytics. Supporting Postgres, MongoDB , MySQL and Oracle
spotify/klio
Smarter data pipelines for audio.
AgnostiqHQ/covalent
Pythonic tool for orchestrating machine-learning/high performance/quantum-computing workflows in heterogeneous compute environments.
apache/seatunnel-web
SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).
ssp-data/practical-data-engineering
Practical Data Engineering: A Hands-On Real-Estate Project Guide
infoslack/awesome-kafka
A list about Apache Kafka
ConduitIO/conduit
Conduit streams data between data stores. Kafka Connect replacement. No JVM required.
InfuseAI/piperider
Code review for data in dbt
sparkfish/augraphy
Augmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes
1kbgz/tributary
Streaming reactive and dataflow graphs in Python
pracdata/awesome-open-source-data-engineering
A curated list of open source tools used in analytics platforms and data engineering ecosystem
msamogh/nonechucks
Deal with bad samples in your dataset dynamically, use Transforms as Filters, and more!
josephmachado/efficient_data_processing_spark
Code for "Efficient Data Processing in Spark" Course
dataflint/spark
Drop-in replacement for Apache Spark UI
cuebook/cuelake
Use SQL to build ELT pipelines on a data lakehouse.
airscholar/e2e-data-engineering
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.