data-pipeline
There are 630 repositories under data-pipeline topic.
airbytehq/airbyte
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
kestra-io/kestra
Infinitely scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.
snowplow/snowplow
The leader in Next-Generation Customer Data Infrastructure
apache/flink-cdc
Flink CDC is a streaming data integration tool
rudderlabs/rudder-server
Privacy and Security focused Segment-alternative, in Golang and React
adilkhash/Data-Engineering-HowTo
A list of useful resources to learn Data Engineering from scratch
superstreamlabs/memphis
Memphis.dev is a highly scalable and effortless data streaming platform
whylabs/whylogs
An open-source data logging library for machine learning models and data pipelines. 📚 Provides visibility into data quality & model performance over time. 🛡️ Supports privacy-preserving data collection, ensuring safety & robustness. 📈
bruin-data/ingestr
ingestr is a CLI tool to copy data between any databases with a single command seamlessly.
pydoit/doit
task management & automation tool
elementary-data/elementary
The dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.
reugn/go-streams
A lightweight stream processing library for Go
bytedance/bitsail
BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every day.
GoogleCloudPlatform/data-science-on-gcp
Source code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017
damklis/DataEngineeringProject
Example end to end data engineering project.
spotify/klio
Smarter data pipelines for audio.
AgnostiqHQ/covalent
Pythonic tool for orchestrating machine-learning/high performance/quantum-computing workflows in heterogeneous compute environments.
Multiwoven/multiwoven
🔥🔥🔥 Open Source Alternative to Hightouch, Census, and RudderStack - Reverse ETL & Customer Data Platform (CDP)
infoslack/awesome-kafka
A list about Apache Kafka
InfuseAI/piperider
Code review for data in dbt
sspaeti-com/practical-data-engineering
Practical Data Engineering: A Hands-On Real-Estate Project Guide
apache/seatunnel-web
SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).
streamlet-dev/tributary
Streaming reactive and dataflow graphs in Python
msamogh/nonechucks
Deal with bad samples in your dataset dynamically, use Transforms as Filters, and more!
ConduitIO/conduit
Conduit streams data between data stores. Kafka Connect replacement. No JVM required.
sparkfish/augraphy
Augmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes
cuebook/cuelake
Use SQL to build ELT pipelines on a data lakehouse.
feldera/feldera
Feldera Continuous Analytics Platform
ubisoft/mobydq
:whale: Tool to automate data quality checks on data pipelines
pipeline-tools/gusty
Making DAG construction easier
scicloj/scicloj.ml
A Clojure machine learning library
olirice/flupy
Fluent data pipelines for python and your shell
digitalghost-dev/premier-league
A Data Engineering project. Repository for backend infrastructure and Streamlit app files for a Premier League Dashboard.
unnati-xyz/scalable-data-science-platform
Content for architecting a data science platform for products using Luigi, Spark & Flask.
aeksco/aws-pdf-textract-pipeline
:mag: Data pipeline for crawling PDFs from the Web and transforming their contents into structured data using AWS textract. Built with AWS CDK + TypeScript
tvdboom/ATOM
Automated Tool for Optimized Modelling