big-data

There are 3976 repositories under big-data topic.

  • awesome-scalability

    binhnguyennus/awesome-scalability

    The Patterns of Scalable, Reliable, and Performant Large-Scale Systems

  • apache/spark

    Apache Spark - A unified analytics engine for large-scale data processing

    Language:Scala38.5k2k028k
  • ClickHouse

    ClickHouse/ClickHouse

    ClickHouse® is a free analytics DBMS for big data

    Language:C++34.6k68519.7k6.5k
  • donnemartin/data-science-ipython-notebooks

    Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

    Language:Python26.5k1.6k397.7k
  • apache/flink

    Apache Flink

    Language:Java23.2k945013k
  • amark/gun

    An open source cybersecurity protocol for syncing decentralized graph data.

    Language:JavaScript17.8k3197901.1k
  • presto

    prestodb/presto

    The official home of the Presto distributed SQL query engine for big data

    Language:Java15.6k8616.3k5.3k
  • heibaiying/BigData-Notes

    大数据入门指南 :star:

    Language:Java15.3k442434.2k
  • questdb

    questdb/questdb

    An open source time-series database for fast ingest and SQL queries

    Language:Java13.5k1301.6k983
  • andkret/Cookbook

    The Data Engineering Cookbook

  • apache/predictionio

    PredictionIO, a machine learning server for developers and ML engineers.

    Language:Scala12.5k75501.9k
  • yahoo/CMAK

    CMAK is a tool for managing Apache Kafka clusters

    Language:Scala11.7k5336852.5k
  • nebula

    vesoft-inc/nebula

    A distributed, fast open-source graph database featuring horizontal scalability and high availability

    Language:C++10.2k1852.5k1.2k
  • trinodb/trino

    Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

    Language:Java9.6k1696.2k2.8k
  • cython/cython

    The most widely used Python to C compiler

    Language:Python9k2423.6k1.5k
  • kafka-ui

    provectus/kafka-ui

    Open-Source Web UI for Apache Kafka Management

    Language:Java8.6k681.7k1.1k
  • StarRocks/starrocks

    StarRocks, a Linux Foundation project, is a next-generation sub-second MPP OLAP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics, and ad-hoc queries. InfoWorld’s 2023 BOSSIE Award for best open source software.

    Language:Java8k2087.1k1.6k
  • catboost/catboost

    A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

    Language:Python7.8k1922.3k1.2k
  • apache/beam

    Apache Beam is a unified programming model for Batch and Streaming data processing.

    Language:Java7.6k2636.7k4.1k
  • delta-io/delta

    An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs

    Language:Scala6.9k2151.4k1.6k
  • h2oai/h2o-3

    H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

    Language:Jupyter Notebook6.7k3849.3k2k
  • risingwave

    risingwavelabs/risingwave

    SQL stream processing, analytics, and management. We decouple storage and compute to offer speedy bootstrapping, dynamic scaling, time-travel queries, and efficient joins.

    Language:Rust6.4k785.8k521
  • apache/zeppelin

    Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

    Language:Java6.3k31702.8k
  • quickwit

    quickwit-oss/quickwit

    Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.

    Language:Rust6.2k542.1k274
  • arkime/arkime

    Arkime is an open source, large scale, full packet capturing, indexing, and database system.

    Language:JavaScript6.1k3491.4k1k
  • pachyderm/pachyderm

    Data-Centric Pipelines and Data Versioning

    Language:Go6.1k1633.1k564
  • couchdb

    apache/couchdb

    Seamless multi-master syncing database with an intuitive HTTP/JSON API, designed for reliability

    Language:Erlang6k2381.5k1k
  • hazelcast/hazelcast

    Hazelcast is a unified real-time data platform combining stream processing with a fast data store, allowing customers to act instantly on data-in-motion for real-time insights.

    Language:Java5.9k2958.5k1.8k
  • vespa

    vespa-engine/vespa

    AI + Data, online. https://vespa.ai

    Language:Java5.4k158932574
  • apache/hive

    Apache Hive

    Language:Java5.3k33604.6k
  • feast-dev/feast

    The Open Source Feature Store for Machine Learning

    Language:Python5.3k721.3k938
  • apache/datafusion

    Apache DataFusion SQL Query Engine

    Language:Rust5.2k1054.3k956
  • SynapseML

    microsoft/SynapseML

    Simple and Distributed Machine Learning

    Language:Scala5k146708817
  • tschellenbach/Stream-Framework

    Stream Framework is a Python library, which allows you to build news feed, activity streams and notification systems using Cassandra and/or Redis. The authors of Stream-Framework also provide a cloud service for feed technology:

    Language:Python4.7k210186542
  • apache/ignite

    Apache Ignite

    Language:Java4.7k2801041.9k
  • apache/calcite

    Apache Calcite

    Language:Java4.4k16902.3k