data-pipeline

There are 1250 repositories under data-pipeline topic.

apache/shardingsphere
Empowering Data Intelligence with Distributed SQL for Sharding, Scalability, and Security Across All Databases.
Language:Java20.5k 974 11.7k6.9k
airbytehq/airbyte
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
Language:Python20k 186 15.5k4.9k
debezium/debezium
Change data capture for a variety of databases. Please log issues at https://issues.redhat.com/browse/DBZ.
Language:Java12.1k 212 22.8k
snowplow/snowplow
The leader in Customer Data Infrastructure
Language:Scala7k 261 4k1.2k
apache/flink-cdc
Flink CDC is a streaming data integration tool
Language:Java6.3k 136 1.7k2.1k
datajuicer/data-juicer
Data processing for and with foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷
Language:Python5.5k 18 301288
rudderlabs/rudder-server
Privacy and Security focused Segment-alternative, in Golang and React
Language:Go4.3k 60 1469
adilkhash/Data-Engineering-HowTo
A list of useful resources to learn Data Engineering from scratch
3.9k 102 2561
superstreamlabs/memphis
Memphis.dev is a highly scalable and effortless data streaming platform
Language:Go3.4k 32 264230
bruin-data/ingestr
ingestr is a CLI tool to copy data between any databases with a single command seamlessly.
Language:Python3.3k 19 38113
dagu-org/dagu
A powerful, portable, local-first workflow engine for managing complex jobs without pain. Single binary with Web UI. 100% open source. No vendor lock-in. It natively supports running containers and executing commands over SSH. Offline or air-gapped environment ready.
Language:Go2.8k 21 535210
whylabs/whylogs
An open-source data logging library for machine learning models and data pipelines. 📚 Provides visibility into data quality & model performance over time. 🛡️ Supports privacy-preserving data collection, ensuring safety & robustness. 📈
Language:Jupyter Notebook2.8k 31 433134
elementary-data/elementary
The dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.
Language:HTML2.2k 11 657202
reugn/go-streams
A lightweight stream processing library for Go
Language:Go2.1k 25 42172
pydoit/doit
CLI task management & automation tool
Language:Python2k 47 306185
bytedance/bitsail
BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every day.
Language:Java1.7k 58 212333
Multiwoven/multiwoven
🔥🔥🔥 Open source Reverse ETL - alternative to hightouch and census.
Language:Ruby1.6k 14 4582
superlinked/superlinked
Superlinked is a Python framework for AI Engineers building high-performance search & recommendation applications that combine structured and unstructured data.
Language:Jupyter Notebook1.4k 28 57110
GoogleCloudPlatform/data-science-on-gcp
Source code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017
Language:Jupyter Notebook1.4k 98 104733
damklis/DataEngineeringProject
Example end to end data engineering project.
Language:Python1.3k 16 6269
datazip-inc/olake
Fastest open-source tool for replicating Databases to Data Lake in Open Table Formats like Apache Iceberg. ⚡ Efficient, quick and scalable data ingestion for real-time analytics. Supporting Postgres, MongoDB , MySQL and Oracle
Language:Go1.2k 8 192139
spotify/klio
Smarter data pipelines for audio.
Language:Python863 17 654
AgnostiqHQ/covalent
Pythonic tool for orchestrating machine-learning/high performance/quantum-computing workflows in heterogeneous compute environments.
Language:Python845 23 815106
apache/seatunnel-web
SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).
Language:Java748 24 0327
ssp-data/practical-data-engineering
Practical Data Engineering: A Hands-On Real-Estate Project Guide
Language:Jupyter Notebook712 10 4120
infoslack/awesome-kafka
A list about Apache Kafka
583 29 1165
ConduitIO/conduit
Conduit streams data between data stores. Kafka Connect replacement. No JVM required.
Language:Go565 12 59855
InfuseAI/piperider
Code review for data in dbt
Language:Python492 11 7524
sparkfish/augraphy
Augmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes
Language:Python474 10 14159
1kbgz/tributary
Streaming reactive and dataflow graphs in Python
Language:Python458 14 8338
pracdata/awesome-open-source-data-engineering
A curated list of open source tools used in analytics platforms and data engineering ecosystem
392 23 141
msamogh/nonechucks
Deal with bad samples in your dataset dynamically, use Transforms as Filters, and more!
Language:Python378 2 3227
josephmachado/efficient_data_processing_spark
Code for "Efficient Data Processing in Spark" Course
Language:Python346 1 372
dataflint/spark
Drop-in replacement for Apache Spark UI
Language:TypeScript341 5 1641
cuebook/cuelake
Use SQL to build ELT pipelines on a data lakehouse.
Language:JavaScript288 11 2928
airscholar/e2e-data-engineering
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.
Language:Python287 5 8132

data-pipeline

apache/shardingsphere

airbytehq/airbyte

debezium/debezium

snowplow/snowplow

apache/flink-cdc

datajuicer/data-juicer

rudderlabs/rudder-server

adilkhash/Data-Engineering-HowTo

superstreamlabs/memphis

bruin-data/ingestr

dagu-org/dagu

whylabs/whylogs

elementary-data/elementary

reugn/go-streams

pydoit/doit

bytedance/bitsail

Multiwoven/multiwoven

superlinked/superlinked

GoogleCloudPlatform/data-science-on-gcp

damklis/DataEngineeringProject

datazip-inc/olake

spotify/klio

AgnostiqHQ/covalent

apache/seatunnel-web

ssp-data/practical-data-engineering

infoslack/awesome-kafka

ConduitIO/conduit

InfuseAI/piperider

sparkfish/augraphy

1kbgz/tributary

pracdata/awesome-open-source-data-engineering

msamogh/nonechucks

josephmachado/efficient_data_processing_spark

dataflint/spark

cuebook/cuelake

airscholar/e2e-data-engineering