datalake
There are 225 repositories under datalake topic.
Sinaptik-AI/pandas-ai
Chat with your database (SQL, CSV, pandas, polars, mongodb, noSQL, etc). PandasAI makes data analysis conversational using LLMs (GPT 3.5 / 4, Anthropic, VertexAI) and RAG.
trinodb/trino
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
StarRocks/starrocks
StarRocks, a Linux Foundation project, is a next-generation sub-second MPP OLAP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics, and ad-hoc queries.
activeloopai/deeplake
Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop.ai
apache/hudi
Upserts, Deletes And Incremental Processing on Big Data.
paradedb/paradedb
Postgres for Search and Analytics
treeverse/lakeFS
lakeFS - Data version control for your data lake | Git for data
DataLinkDC/dinky
Dinky is a real-time data development platform based on Apache Flink, enabling agile data development, deployment and operation.
lakesoul-io/LakeSoul
LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.
leo-project/leofs
The LeoFS Storage System
zinggAI/zingg
Scalable identity resolution, entity resolution, data mastering and deduplication using ML
apache/amoro
Apache Amoro (incubating) is a Lakehouse management system built on open data lake formats.
leesf/hudi-resources
汇总Apache Hudi相关资料
datastrato/gravitino
World's most powerful open data catalog for building a high-performance, geo-distributed and federated metadata lake.
Datavault-UK/automate-dv
A free to use dbt package for creating and loading Data Vault 2.0 compliant Data Warehouses (powered by dbt, an open source data engineering tool, registered trademark of dbt Labs)
cuebook/cuelake
Use SQL to build ELT pipelines on a data lakehouse.
linkedin/openhouse
Open Control Plane for Tables in Data Lakehouse
japila-books/delta-lake-internals
The Internals of Delta Lake
awslabs/aws-orbit-workbench
A Data Platform built for AWS, powered by Kubernetes.
UncoderIO/Uncoder_IO
An IDE and translation engine for detection engineers and threat hunters. Be faster, write smarter, keep 100% privacy.
UncoderIO/RootA
Roota is a public-domain language of threat detection and response that combines native queries from a SIEM, EDR, XDR, or Data Lake with standardized metadata and threat intelligence to enable automated translation into other languages
izhangzhihao/Real-time-Data-Warehouse
Real-time Data Warehouse with Apache Flink & Apache Kafka & Apache Hudi
WeBankFinTech/Streamis
Streaming application development and management system, based on Linkis and DSS, planning to provide the workflow-like graphical drag-and-drop development capability.
martandsingh/ApacheSpark
This repository will help you to learn about databricks concept with the help of examples. It will include all the important topics which we need in our real life experience as a data engineer. We will be using pyspark & sparksql for the development. At the end of the course we also cover few case studies.
LearningJournal/SparkProgrammingInScala
Apache Spark Course Material
pracdata/awesome-open-source-data-engineering
A curated list of open source tools used in analytical stacks and data engineering ecosystem
apache/doris-website
Apache Doris Website
vim89/datapipelines-essentials-python
Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
GitDataAI/jiaozifs
An Git-like version control file system for data lineage & data collaboration.
fuslab/anyscale
anyscale roadmap
hifxit/dataligo
A library to accelerate ML and ETL pipeline by connecting all data sources
PaloAltoNetworks/pan-cortex-data-lake-python
Python idiomatic SDK for Cortex™ Data Lake.
LearningJournal/Spark-Streaming-In-Scala
Apache Spark 3 - Structured Streaming Course Material
rlevchenko/terraform-azure-data
Terraform script to deploy almost all Azure Data Services
ExpediaGroup/apiary
Apiary provides modules which can be combined to create a federated cloud data lake
DataTech-Solutions/Threat-Detection-and-Visualization
Threat Detection and Visualization