/awesome-spark

A curated list of awesome Apache Spark packages and resources.

Primary LanguageShellCreative Commons Zero v1.0 UniversalCC0-1.0

Awesome Spark Awesome

A curated list of awesome Apache Spark packages and resources.

Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance (Wikipedia 2017).

Users of Apache Spark may choose between different the Python, R, Scala and Java programming languages to interface with the Apache Spark APIs.

Packages

Language Bindings

Notebooks and IDEs

  • almond - A scala kernel for Jupyter.
  • Apache Zeppelin - Web-based notebook that enables interactive data analytics with plugable backends, integrated plotting, and extensive Spark support out-of-the-box.
  • Polynote - Polynote: an IDE-inspired polyglot notebook. It supports mixing multiple languages in one notebook, and sharing data between them seamlessly. It encourages reproducible notebooks with its immutable data model. Originating from Netflix.
  • sparkmagic - Jupyter magics and kernels for working with remote Spark clusters, for interactively working with remote Spark clusters through Livy, in Jupyter notebooks.

General Purpose Libraries

  • itachi - A library that brings useful functions from modern database management systems to Apache Spark.
  • spark-daria - A Scala library with essential Spark functions and extensions to make you more productive.
  • quinn - A native PySpark implementation of spark-daria.
  • Apache DataFu - A library of general purpose functions and UDF's.
  • Joblib Apache Spark Backend - joblib backend for running tasks on Spark clusters.

SQL Data Sources

SparkSQL has serveral built-in Data Sources for files. These include csv, json, parquet, orc, and avro. It also supports JDBC databases as well as Apache Hive. Additional data sources can be added by including the packages listed below, or writing your own.

Storage

  • Delta Lake - Storage layer with ACID transactions.
  • Apache Hudi - Upserts, Deletes And Incremental Processing on Big Data..
  • Apache Iceberg - Upserts, Deletes And Incremental Processing on Big Data..
  • lakeFS - Integration with the lakeFS atomic versioned storage layer.

Bioinformatics

  • ADAM - Set of tools designed to analyse genomics data.
  • Hail - Genetic analysis framework.

GIS

  • Apache Sedona - Cluster computing system for processing large-scale spatial data.

Graph Processing

Machine Learning Extension

  • Apache SystemML - Declarative machine learning framework on top of Spark.
  • Mahout Spark Bindings [status unknown] - linear algebra DSL and optimizer with R-like syntax.
  • KeystoneML - Type safe machine learning pipelines with RDDs.
  • JPMML-Spark - PMML transformer library for Spark ML.
  • ModelDB - A system to manage machine learning models for spark.ml and scikit-learn .
  • Sparkling Water - H2O interoperability layer.
  • BigDL - Distributed Deep Learning library.
  • MLeap - Execution engine and serialization format which supports deployment of o.a.s.ml models without dependency on SparkSession.
  • Microsoft ML for Apache Spark - A distributed ml library with support for LightGBM, Vowpal Wabbit, OpenCV, Deep Learning, Cognitive Services, and Model Deployment.
  • MLflow - Machine learning orchestration platform.

Middleware

  • Livy - REST server with extensive language support (Python, R, Scala), ability to maintain interactive sessions and object sharing.
  • spark-jobserver - Simple Spark as a Service which supports objects sharing using so called named objects. JVM only.
  • Apache Toree - IPython protocol based middleware for interactive applications.
  • Apache Kyuubi - A distributed multi-tenant JDBC server for large-scale data processing and analytics, built on top of Apache Spark.

Monitoring

Utilities

  • sparkly - Helpers & syntactic sugar for PySpark.
  • Flintrock - A command-line tool for launching Spark clusters on EC2.
  • Optimus - Data Cleansing and Exploration utilities with the goal of simplifying data cleaning.

Natural Language Processing

  • spark-nlp - Natural language processing library built on top of Apache Spark ML.

Streaming

  • Apache Bahir - Collection of the streaming connectors excluded from Spark 2.0 (Akka, MQTT, Twitter. ZeroMQ).

Interfaces

  • Apache Beam - Unified data processing engine supporting both batch and streaming applications. Apache Spark is one of the supported execution environments.
  • Koalas - Pandas DataFrame API on top of Apache Spark.

Data quality

  • deequ - Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
  • python-deequ - Python API for Deequ.

Testing

Web Archives

Workflow Management

Resources

Books

Papers

MOOCS

Workshops

Projects Using Spark

  • Oryx 2 - Lambda architecture platform built on Apache Spark and Apache Kafka with specialization for real-time large scale machine learning.
  • Photon ML - A machine learning library supporting classical Generalized Mixed Model and Generalized Additive Mixed Effect Model.
  • PredictionIO - Machine Learning server for developers and data scientists to build and deploy predictive applications in a fraction of the time.
  • Crossdata - Data integration platform with extended DataSource API and multi-user environment.

Docker Images

Miscellaneous

References

Wikipedia. 2017. “Apache Spark — Wikipedia, the Free Encyclopedia.” https://en.wikipedia.org/w/index.php?title=Apache_Spark&oldid=781182753.

License

Public Domain Mark
This work (Awesome Spark, by https://github.com/awesome-spark/awesome-spark), identified by Maciej Szymkiewicz, is free of known copyright restrictions.

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation. This compilation is not endorsed by The Apache Software Foundation.

Inspired by sindresorhus/awesome.