apache-spark
There are 2034 repositories under apache-spark topic.
mlflow/mlflow
The open source developer platform to build AI/LLM applications and models with confidence. Enhance your AI applications with end-to-end tracking, observability, and evaluations, all in one integrated platform.
microsoft/SynapseML
Simple and Distributed Machine Learning
treeverse/lakeFS
lakeFS - Data version control for your data lake | Git for data
lw-lin/CoolplaySpark
酷玩 Spark: Spark 源代码解析、Spark 类库等
spark-notebook/spark-notebook
Interactive and Reactive Data Science using Scala and Spark.
kubeflow/spark-operator
Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
intel/BigDL
BigDL: Distributed TensorFlow, Keras and PyTorch on Apache Spark/Flink & Ray
dotnet/spark
.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.
big-data-europe/docker-spark
Apache Spark docker image
feathr-ai/feathr
Feathr – A scalable, unified data and AI engineering platform for enterprise
awesome-spark/awesome-spark
A curated list of awesome Apache Spark packages and resources.
OryxProject/oryx
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning
ptyadana/SQL-Data-Analysis-and-Visualization-Projects
SQL data analysis & visualization projects using MySQL, PostgreSQL, SQLite, Tableau, Apache Spark and pySpark.
japila-books/apache-spark-internals
The Internals of Apache Spark
san089/goodreads_etl_pipeline
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
databricks/LearningSparkV2
This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]
lensacom/sparkit-learn
PySpark + Scikit-learn = Sparkit-learn
mahmoudparsian/data-algorithms-book
MapReduce, Spark, Java, and Scala for Data Algorithms Book
databricks/spark-sklearn
(Deprecated) Scikit-learn integration package for Apache Spark
graphframes/graphframes
GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs
sparklyr/sparklyr
R interface for Apache Spark
microsoft/Mobius
C# and F# language binding and extensions to Apache Spark
LucaCanali/sparkMeasure
This is the development repository for sparkMeasure, a tool and library designed for efficient analysis and troubleshooting of Apache Spark jobs. It focuses on easing the collection and examination of Spark metrics, making it a practical choice for both developers and data engineers.
aloneguid/parquet-dotnet
Fully managed Apache Parquet implementation
lw-lin/streaming-readings
Streaming System 相关的论文读物
miguno/kafka-storm-starter
[PROJECT IS NO LONGER MAINTAINED] Code examples that show to integrate Apache Kafka 0.8+ with Apache Storm 0.9+ and Apache Spark Streaming 1.1+, while using Apache Avro as the data serialization format.
mrpowers-io/quinn
pyspark methods to enhance developer productivity 📣 👯 🎉
nchammas/flintrock
A command-line tool for launching Apache Spark clusters.
cerndb/dist-keras
Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.
apache-spark-on-k8s/spark
Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
openscoring/openscoring
REST web service for the true real-time scoring (<1 ms) of Scikit-Learn, R and Apache Spark models
infoslack/awesome-kafka
A list about Apache Kafka
cartershanklin/pyspark-cheatsheet
PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
japila-books/spark-sql-internals
The Internals of Spark SQL
rjurney/Agile_Data_Code_2
Code for Agile Data Science 2.0, O'Reilly 2017, Second Edition
LucaCanali/Miscellaneous
Includes notes on using Apache Spark, with drill down on Spark for Physics, how to run TPCDS on PySpark, how to create histograms with Spark. Also tools for stress testing and measuring CPUs's performance. Jupyter notebooks examples for using various DB systems.