spark

There are 8212 repositories under spark topic.

  • apache/spark

    Apache Spark - A unified analytics engine for large-scale data processing

    Language:Scala38.3k2k027.9k
  • donnemartin/data-science-ipython-notebooks

    Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

    Language:Python26.4k1.6k397.7k
  • getredash/redash

    Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.

    Language:Python24.9k5772.4k4.2k
  • yeasy/docker_practice

    Learn and understand Docker&Container technologies, with real DevOps practice!

    Language:Go24.2k8452115.7k
  • DataTalksClub/data-engineering-zoomcamp

    Free Data Engineering course!

    Language:Jupyter Notebook22.4k4031244.8k
  • heibaiying/BigData-Notes

    大数据入门指南 :star:

    Language:Java15.3k442434.1k
  • GaiZhenbiao/ChuanhuChatGPT

    GUI for ChatGPT API and many LLMs. Supports agents, file-based QA, GPT finetuning and query with web search. All with a neat UI.

    Language:Python14.7k847542.2k
  • flink-learning

    zhisheng17/flink-learning

    flink learning blog. http://www.54tianzhisheng.cn/ 含 Flink 入门、概念、原理、实战、性能调优、源码解析等内容。涉及 Flink Connector、Metrics、Library、DataStream API、Table API & SQL 等内容的学习案例,还有 Flink 落地应用的大型项目案例(PVUV、日志存储、百亿数据实时去重、监控告警)分享。欢迎大家支持我的专栏《大数据实时计算引擎 Flink 实战与性能优化》

    Language:Java14.2k51603.9k
  • horovod/horovod

    Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

    Language:Python13.9k3342.2k2.2k
  • aalansehaiyang/technology-talk

    【大厂面试专栏】一份Java程序员需要的技术指南,这里有面试题、系统架构、职场锦囊、主流中间件等,让你成为更牛的自己!

  • FavioVazquez/ds-cheatsheets

    List of Data Science Cheatsheets to rule the world

  • deeplearning4j/deeplearning4j

    Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learning using automatic differentiation.

    Language:Java13.4k7695.7k3.8k
  • apache/doris

    Apache Doris is an easy-to-use, high performance and unified analytics database.

    Language:Java11.3k2776.8k3k
  • wangzhiwubigdata/God-Of-BigData

    专注大数据学习面试,大数据成神之路开启。Flink/Spark/Hadoop/Hbase/Hive...

  • mage-ai/mage-ai

    🧙 Build, run, and manage data pipelines for integrating and transforming data.

    Language:Python7k60581622
  • delta-io/delta

    An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs

    Language:Scala6.9k2141.4k1.6k
  • h2oai/h2o-3

    H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

    Language:Jupyter Notebook6.7k3839.3k2k
  • Angel-ML/angel

    A Flexible and Powerful Parameter Server for large-scale machine learning

    Language:Java6.7k4506061.6k
  • Alluxio/alluxio

    Alluxio, data orchestration for analytics and machine learning in the cloud

    Language:Java6.6k4382.2k2.9k
  • risingwave

    risingwavelabs/risingwave

    Cloud-native SQL stream processing, analytics, and management. KsqlDB and Apache Flink alternative. 🚀 10x more productive. 🚀 10x more cost-efficient.

    Language:Rust6.3k785.6k508
  • apache/zeppelin

    Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

    Language:Java6.3k31602.8k
  • donnemartin/dev-setup

    macOS development environment setup: Easy-to-understand instructions with automated setup scripts for developer tools like Vim, Sublime Text, Bash, iTerm, Python data analysis, Spark, Hadoop MapReduce, AWS, Heroku, JavaScript web development, Android development, common data stores, and dev-based OS X defaults.

    Language:Python6.1k188461.1k
  • tobymao/sqlglot

    Python SQL Parser and Transpiler

    Language:Python5.4k391.4k538
  • SynapseML

    microsoft/SynapseML

    Simple and Distributed Machine Learning

    Language:Scala5k147707812
  • PipelineAI/pipeline

    PipelineAI

    Language:Jsonnet4.2k347254972
  • yahoo/TensorFlowOnSpark

    TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.

    Language:Python3.9k283367947
  • Cyb3rWard0g/HELK

    The Hunting ELK

    Language:Jupyter Notebook3.7k216452672
  • spark-nlp

    JohnSnowLabs/spark-nlp

    State of the Art Natural Language Processing

    Language:Scala3.7k100865698
  • lw-lin/CoolplaySpark

    酷玩 Spark: Spark 源代码解析、Spark 类库等

    Language:Scala3.4k443371.4k
  • RoaringBitmap/RoaringBitmap

    A better compressed bitset in Java: used by Apache Spark, Netflix Atlas, Apache Pinot, Tablesaw, and many others

    Language:Java3.4k132316525
  • liyupi/sql-generator

    🔨 用 JSON 来生成结构化的 SQL 语句,基于 Vue3 + TypeScript + Vite + Ant Design + MonacoEditor 实现,项目简单(重逻辑轻页面)、适合练手~

    Language:Vue3.4k2021695
  • databricks/koalas

    Koalas: pandas API on Apache Spark

    Language:Python3.3k315588353
  • apache/linkis

    Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.

    Language:Java3.2k2622.5k1.1k
  • spark-notebook/spark-notebook

    Interactive and Reactive Data Science using Scala and Spark.

    Language:JavaScript3.1k190515654
  • awslabs/deequ

    Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

    Language:Scala3.1k81331513
  • WeBankFinTech/DataSphereStudio

    DataSphereStudio is a one stop data application development& management portal, covering scenarios including data exchange, desensitization/cleansing, analysis/mining, quality measurement, visualization, and task scheduling.

    Language:Java2.9k181741984