spark

There are 8334 repositories under spark topic.

  • spark-jobserver

    REST job server for Apache Spark

    Language:Scala2.8k
  • cube-studio

    cube-studio

    cube studio开源云原生一站式机器学习/深度学习/大模型AI平台,支持sso登录,多租户,大数据平台对接,notebook在线开发,拖拉拽任务流pipeline编排,多机多卡分布式训练,超参搜索,推理服务VGPU,边缘计算,serverless,标注平台,自动化标注,数据集管理,大模型微调,vllm大模型推理,llmops,私有知识库,AI模型应用商店,支持模型一键开发/推理/微调,支持国产cpu/gpu/npu芯片,支持RDMA,支持pytorch/tf/mxnet/deepspeed/paddle/colossalai/horovod/spark/ray/volcano分布式

    Language:Jupyter Notebook2.7k
  • dpark

    Python clone of Spark, a MapReduce alike framework in Python

    Language:Python2.7k
  • spark-operator

    Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.

    Language:Go2.7k
  • BigDataGuide

    大数据学习,从零开始学习大数据,包含大数据学习各阶段学习视频、面试资料

    Language:Java2.5k
  • spring-boot-quick

    :herb: 基于springboot的快速学习示例,整合自己遇到的开源框架,如:rabbitmq(延迟队列)、Kafka、jpa、redies、oauth2、swagger、jsp、docker、k3s、k3d、k8s、mybatis加解密插件、异常处理、日志输出、多模块开发、多环境打包、缓存cache、爬虫、jwt、GraphQL、dubbo、zookeeper和Async等等:pushpin:

    Language:Java2.4k
  • LakeSoul

    LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.

    Language:Java2.3k
  • TransmogrifAI

    TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning

    Language:Scala2.2k
  • SZT-bigdata

    SZT-bigdata

    深圳地铁大数据客流分析系统🚇🚄🌟

    Language:Scala2.2k
  • zio-quill

    Compile-time Language Integrated Queries for Scala

    Language:Scala2.1k
  • Quicksql

    A Flexible, Fast, Federated(3F) SQL Analysis Middleware for Multiple Data Sources

    Language:Java2.1k
  • paimon

    Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.

    Language:Java2k
  • spark

    .NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.

    Language:C#2k
  • kyuubi

    Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.

    Language:Scala2k
  • spark-ml-source-analysis

    spark ml 算法原理剖析以及具体的源码实现分析

  • spark-cassandra-connector

    DataStax Connector for Apache Spark to Apache Cassandra

    Language:Scala1.9k
  • fugue

    A unified interface for distributed computing. Fugue executes SQL, Python, Pandas, and Polars code on Spark, Dask and Ray without any rewrites.

    Language:Python1.9k
  • benchm-ml

    A minimal benchmark for scalability, speed and accuracy of commonly used open source implementations (R packages, Python scikit-learn, H2O, xgboost, Spark MLlib etc.) of the top machine learning algorithms for binary classification (random forests, gradient boosted trees, deep neural networks etc.).

    Language:R1.9k
  • ytsaurus

    YTsaurus is a scalable and fault-tolerant open-source big data platform.

    Language:C++1.8k
  • Gaffer

    A large-scale entity and relation database supporting aggregation of properties

    Language:Java1.7k
  • .github

    ApacheCN 开源组织:公告、介绍、成员、活动、交流方式

    Language:CSS1.7k
  • elassandra

    Elassandra = Elasticsearch + Apache Cassandra

    Language:Java1.7k
  • gatk

    Official code repository for GATK versions 4 and up

    Language:Java1.6k
  • spark-py-notebooks

    Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

    Language:Jupyter Notebook1.6k
  • Spark

    ✨Spark is a web-based, cross-platform and full-featured Remote Administration Tool (RAT) written in Go that allows you control all your devices anywhere. Spark是一个Go编写的,网页UI、跨平台以及多功能的远程控制和监控工具,你可以随时随地监控和控制所有设备。

    Language:Go1.6k
  • almond

    A Scala kernel for Jupyter

    Language:Scala1.6k
  • Tutorial

    后端 (Java Golang)全栈知识架构体系总结

    Language:Shell1.6k
  • elephas

    Distributed Deep learning with Keras & Spark

    Language:Python1.6k
  • BigData-Interview

    :dart: :star2:[大数据面试题]分享自己在网络上收集的大数据相关的面试题以及自己的答案总结.目前包含Hadoop/Hive/Spark/Flink/Hbase/Kafka/Zookeeper框架的面试题知识总结

  • pyspark-example-project

    Implementing best practices for PySpark ETL jobs and applications.

    Language:Python1.5k
  • mleap

    MLeap: Deploy ML Pipelines to Production

    Language:Scala1.5k
  • seldon-server

    Machine Learning Platform and Recommendation Engine built on Kubernetes

    Language:Java1.5k
  • optimus

    :truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark

    Language:Python1.4k
  • apache-spark-internals

    The Internals of Apache Spark

  • carbondata

    High performance data store solution

    Language:Scala1.4k
  • dji-firmware-tools

    Tools for handling firmwares of DJI products, with focus on quadcopters.

    Language:C1.4k