data-pipeline
There are 801 repositories under data-pipeline topic.
apache/shardingsphere
Empowering Data Intelligence with Distributed SQL for Sharding, Scalability, and Security Across All Databases.
airbytehq/airbyte
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
snowplow/snowplow
The leader in Customer Data Infrastructure
apache/flink-cdc
Flink CDC is a streaming data integration tool
rudderlabs/rudder-server
Privacy and Security focused Segment-alternative, in Golang and React
adilkhash/Data-Engineering-HowTo
A list of useful resources to learn Data Engineering from scratch
superstreamlabs/memphis
Memphis.dev is a highly scalable and effortless data streaming platform
bruin-data/ingestr
ingestr is a CLI tool to copy data between any databases with a single command seamlessly.
cocoindex-io/cocoindex
Data transformation framework for AI. Ultra performant, with incremental processing.
whylabs/whylogs
An open-source data logging library for machine learning models and data pipelines. 📚 Provides visibility into data quality & model performance over time. 🛡️ Supports privacy-preserving data collection, ensuring safety & robustness. 📈
elementary-data/elementary
The dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.
reugn/go-streams
A lightweight stream processing library for Go
pydoit/doit
CLI task management & automation tool
bytedance/bitsail
BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every day.
Multiwoven/multiwoven
🔥🔥🔥 Open source Reverse ETL - alternative to hightouch and census.
GoogleCloudPlatform/data-science-on-gcp
Source code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017
superlinked/superlinked
Superlinked is a Python framework for AI Engineers building high-performance search & recommendation applications that combine structured and unstructured data.
damklis/DataEngineeringProject
Example end to end data engineering project.
datazip-inc/olake
Fastest open-source tool for replicating Databases to Data Lake in Open Table Formats like Apache Iceberg. ⚡ Efficient, quick and scalable data ingestion for real-time analytics. Supporting Postgres, MongoDB , MySQL and Oracle
spotify/klio
Smarter data pipelines for audio.
AgnostiqHQ/covalent
Pythonic tool for orchestrating machine-learning/high performance/quantum-computing workflows in heterogeneous compute environments.
apache/seatunnel-web
SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).
ssp-data/practical-data-engineering
Practical Data Engineering: A Hands-On Real-Estate Project Guide
infoslack/awesome-kafka
A list about Apache Kafka
ConduitIO/conduit
Conduit streams data between data stores. Kafka Connect replacement. No JVM required.
InfuseAI/piperider
Code review for data in dbt
remyxai/VQASynth
Compose multimodal datasets 🎹
sparkfish/augraphy
Augmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes
streamlet-dev/tributary
Streaming reactive and dataflow graphs in Python
msamogh/nonechucks
Deal with bad samples in your dataset dynamically, use Transforms as Filters, and more!
pracdata/awesome-open-source-data-engineering
A curated list of open source tools used in analytics platforms and data engineering ecosystem
josephmachado/efficient_data_processing_spark
Code for "Efficient Data Processing in Spark" Course
dataflint/spark
Drop-in replacement for Apache Spark UI
cuebook/cuelake
Use SQL to build ELT pipelines on a data lakehouse.
pipeline-tools/gusty
Making DAG construction easier
ubisoft/mobydq
:whale: Tool to automate data quality checks on data pipelines