data-lake
There are 259 repositories under data-lake topic.
treeverse/lakeFS
lakeFS - Data version control for your data lake | Git for data
dlt-hub/dlt
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
apache/kyuubi
Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
bytedance/bitsail
BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every day.
san089/Udacity-Data-Engineering-Projects
Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.
san089/goodreads_etl_pipeline
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
Teradata/kylo
Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.
alanchn31/Data-Engineering-Projects
Personal Data Engineering Projects
Canner/vulcan-sql
Data API Framework for AI Agents and Data Apps
uber/marmaray
Generic Data Ingestion & Dispersal Library for Hadoop
awslabs/aws-serverless-data-lake-framework
Enterprise-grade, production-hardened, serverless data lake on AWS
kaiwaehner/hivemq-mqtt-tensorflow-kafka-realtime-iot-machine-learning-training-inference
Real Time Big Data / IoT Machine Learning (Model Training and Inference) with HiveMQ (MQTT), TensorFlow IO and Apache Kafka - no additional data store like S3, HDFS or Spark required
lakekeeper/lakekeeper
Lakekeeper: A Rust native Iceberg REST Catalog
cuebook/cuelake
Use SQL to build ELT pipelines on a data lakehouse.
awslabs/amazon-s3-find-and-forget
Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)
Azure/usql
U-SQL Examples and Issue Tracking
maxi-k/btrblocks
BtrBlocks: Efficient Columnar Compression for Data Lakes (SIGMOD 2023 Paper)
garystafford/tickit-data-lake-demo
Resources for video demonstrations and blog posts related to DataOps on AWS
Canner/wren-engine
🤖 The semantic engine for LLMs, bringing semantic context to AI agents. 🔥
Azure/AzureDataLake
Samples and Docs for Azure Data Lake Store and Analytics
pixelsdb/pixels
An efficient storage and compute engine for both on-prem and cloud-native data analytics.
LearningJournal/Spark-Streaming-In-Python
Apache Spark 3 - Structured Streaming Course Material
smart-data-lake/smart-data-lake
Smart Automation Tool for building modern Data Lakes and Data Pipelines
datopian/r2-bucket-uploader
Cloudflare R2 bucket File Uploader with multipart upload enabled. Tested with files up to 10 GB size.
GitDataAI/jzfs
A Git-like Version Control File System for AI & Data Product Management.
LearningJournal/SparkProgrammingInScala
Apache Spark Course Material
aws-samples/aws-dbs-refarch-datalake
Reference Architectures for Datalakes on AWS
Jayvardhan-Reddy/Azure-Certification-DP-200
Road to Azure Data Engineer Part-I: DP-200 - Implementing an Azure Data Solution
camunda-community-hub/zeeqs
GraphQL API for Zeebe data
datamindedbe/lighthouse
Lighthouse is a library for data lakes built on top of Apache Spark. It provides high-level APIs in Scala to streamline data pipelines and apply best practices.
dominikhei/Local-Data-LakeHouse
Sample Data Lakehouse deployed in Docker containers using Apache Iceberg, Minio, Trino and a Hive Metastore. Can be used for local testing.
OElesin/querypal
Web UI for Amazon Athena
MatsMoll/aligned
The DBT of ML, as Aligned describes data dependencies in ML systems, and reduce technical data debt
KentHsu/Udacity-Data-Engineering-Nanodgree
Udacity Data Engineering Nanodegree Program
realtimedatalake/rtdl
rtdl makes it easy to build and maintain a real-time data lake
aws-samples/analyzing-reddit-sentiment-with-aws
Learn how to use Kinesis Firehose, AWS Glue, S3, and Amazon Athena by streaming and analyzing reddit comments in realtime. 100-200 level tutorial.