datalake
There are 248 repositories under datalake topic.
Sinaptik-AI/pandas-ai
Chat with your database (SQL, CSV, pandas, polars, mongodb, noSQL, etc). PandasAI makes data analysis conversational using LLMs (GPT 3.5 / 4, Anthropic, VertexAI) and RAG.
trinodb/trino
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
StarRocks/starrocks
The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.
activeloopai/deeplake
Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop.ai
paradedb/paradedb
Postgres for Search and Analytics
apache/hudi
Upserts, Deletes And Incremental Processing on Big Data.
treeverse/lakeFS
lakeFS - Data version control for your data lake | Git for data
DataLinkDC/dinky
Dinky is a real-time data development platform based on Apache Flink, enabling agile data development, deployment and operation.
lakesoul-io/LakeSoul
LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.
leo-project/leofs
The LeoFS Storage System
apache/gravitino
World's most powerful open data catalog for building a high-performance, geo-distributed and federated metadata lake.
zinggAI/zingg
Scalable identity resolution, entity resolution, data mastering and deduplication using ML
apache/amoro
Apache Amoro (incubating) is a Lakehouse management system built on open data lake formats.
leesf/hudi-resources
汇总Apache Hudi相关资料
Datavault-UK/automate-dv
A free to use dbt package for creating and loading Data Vault 2.0 compliant Data Warehouses (powered by dbt, an open source data engineering tool, registered trademark of dbt Labs)
paradedb/pg_analytics
DuckDB-powered analytics for Postgres
linkedin/openhouse
Open Control Plane for Tables in Data Lakehouse
cuebook/cuelake
Use SQL to build ELT pipelines on a data lakehouse.
japila-books/delta-lake-internals
The Internals of Delta Lake
pracdata/awesome-open-source-data-engineering
A curated list of open source tools used in analytics platforms and data engineering ecosystem
UncoderIO/Uncoder_IO
An IDE and translation engine for detection engineers and threat hunters. Be faster, write smarter, keep 100% privacy.
awslabs/aws-orbit-workbench
A Data Platform built for AWS, powered by Kubernetes.
UncoderIO/Roota
Roota is a public-domain language of threat detection and response that combines native queries from a SIEM, EDR, XDR, or Data Lake with standardized metadata and threat intelligence to enable automated translation into other languages
izhangzhihao/Real-time-Data-Warehouse
Real-time Data Warehouse with Apache Flink & Apache Kafka & Apache Hudi
WeBankFinTech/Streamis
Streaming application development and management system, based on Linkis and DSS, planning to provide the workflow-like graphical drag-and-drop development capability.
martandsingh/ApacheSpark
This repository will help you to learn about databricks concept with the help of examples. It will include all the important topics which we need in our real life experience as a data engineer. We will be using pyspark & sparksql for the development. At the end of the course we also cover few case studies.
LearningJournal/SparkProgrammingInScala
Apache Spark Course Material
apache/doris-website
Apache Doris Website
vim89/datapipelines-essentials-python
Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
samber/awesome-olap
A curated list of awesome Online Analytical Processing databases, frameworks, ressources and other awesomeness.
fuslab/anyscale
anyscale roadmap
hifxit/dataligo
A library to accelerate ML and ETL pipeline by connecting all data sources
PaloAltoNetworks/pan-cortex-data-lake-python
Python idiomatic SDK for Cortex™ Data Lake.
LearningJournal/Spark-Streaming-In-Scala
Apache Spark 3 - Structured Streaming Course Material
rlevchenko/terraform-azure-data
Terraform script to deploy almost all Azure Data Services
ExpediaGroup/apiary
Apiary provides modules which can be combined to create a federated cloud data lake