big-data
There are 4837 repositories under big-data topic.
binhnguyennus/awesome-scalability
The Patterns of Scalable, Reliable, and Performant Large-Scale Systems
ClickHouse/ClickHouse
ClickHouse® is a real-time analytics database management system
apache/spark
Apache Spark - A unified analytics engine for large-scale data processing
donnemartin/data-science-ipython-notebooks
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
apache/flink
Apache Flink
amark/gun
An open source cybersecurity protocol for syncing decentralized graph data.
heibaiying/BigData-Notes
大数据入门指南 :star:
prestodb/presto
The official home of the Presto distributed SQL query engine for big data
andkret/Cookbook
The Data Engineering Cookbook
apache/predictionio
PredictionIO, a machine learning server for developers and ML engineers.
yahoo/CMAK
CMAK is a tool for managing Apache Kafka clusters
trinodb/trino
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
vesoft-inc/nebula
A distributed, fast open-source graph database featuring horizontal scalability and high availability
provectus/kafka-ui
Open-Source Web UI for Apache Kafka Management
StarRocks/starrocks
The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.
quickwit-oss/quickwit
Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.
cython/cython
The most widely used Python to C compiler
catboost/catboost
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.
apache/beam
Apache Beam is a unified programming model for Batch and Streaming data processing.
delta-io/delta
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
apache/datafusion
Apache DataFusion SQL Query Engine
h2oai/h2o-3
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
arkime/arkime
Arkime is an open source, large scale, full packet capturing, indexing, and database system.
apache/couchdb
Seamless multi-primary syncing database with an intuitive HTTP/JSON API, designed for reliability
apache/zeppelin
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.
hazelcast/hazelcast
Hazelcast is a unified real-time data platform combining stream processing with a fast data store, allowing customers to act instantly on data-in-motion for real-time insights.
vespa-engine/vespa
AI + Data, online. https://vespa.ai
feast-dev/feast
The Open Source Feature Store for AI/ML
pachyderm/pachyderm
Data-Centric Pipelines and Data Versioning
apache/iotdb
Apache IoTDB
apache/hive
Apache Hive
microsoft/SynapseML
Simple and Distributed Machine Learning
apache/ignite
Apache Ignite
apache/calcite
Apache Calcite
tschellenbach/Stream-Framework
Stream Framework is a Python library, which allows you to build news feed, activity streams and notification systems using Cassandra and/or Redis. The authors of Stream-Framework also provide a cloud service for feed technology:
tangbc/vue-virtual-scroll-list
⚡️A vue component support big amount data list with high render performance and efficient.