/awesome-spark

A curated list of awesome Apache Spark packages and resources.

Creative Commons Zero v1.0 UniversalCC0-1.0

Awesome Spark Awesome

A curated list of awesome Apache Spark packages and resources.

Table of Contents

Packages

Language Bindings

Notebooks and IDEs

  • Apache Zeppelin - Web-based notebook that enables interactive data analytics with plugable backends, integrated plotting, and extensive Spark support out-of-the-box.
  • Spark Notebook - Scalable and stable Scala and Spark focused notebook bridging the gap between JVM and Data Scientists (incl. extendable, typesafe and reactive charts).
  • sparkmagic - Jupyter magics and kernels for working with remote Spark clusters, for interactively working with remote Spark clusters through Livy, in Jupyter notebooks.

General Purpose Libraries

  • Succinct - Support for efficient queries on compressed data.

SQL Data Sources

Bioinformatics

  • ADAM - A set of tools designed to analyse genomics data.
  • Hail - A genetic analysis framework.

GIS

  • Magellan - Geospatial analytics using Spark.
  • GeoSpark - A cluster computing system for processing large-scale spatial data.

Time Series Analytics

  • Spark-Timeseries - A Scala / Java / Python library for interacting with time series data on Apache Spark.

Graph Processing

  • Mazerunner - Graph analytics platform on top of Neo4j and GraphX.
  • GraphFrames - Data frame based graph API.
  • neo4j-spark-connector - Bolt protocol based, Neo4j Connector with RDD, DataFrame and GraphX / GraphFrames support.

Machine Learning Extension

Middleware

  • Livy - REST server with extensive language support (Python, R, Scala), ability to maintain interactive sessions and object sharing.
  • spark-jobserver - A simple Spark as a Service which supports objects sharing using so called named objects. JVM only.
  • Mist - HTTP and MQTT API intended to expose Spark to exeternal services.
  • Apache Toree - IPython protocol based middleware for interactive applications.

Utilities

  • silex - A bunch of tools varying from ML extensions to additional RDD methods.

Natural Language Processing

Streaming

  • Apache Bahir - A collection of the streaming connectors excluded from Spark 2.0 (Akka, MQTT, Twitter. ZeroMQ).

Resources

Books

MOOCS

Workshops

Projects Using Spark

  • Oryx 2 - A lambda architecture built on Apache Spark and Apache Kafka with specialization for real-time large scale machine learning.
  • Photon ML - A machine learning library supporting classical Generalized Mixed Model and Generalized Additive Mixed Effect Model.
  • PredictionIO - Machine Learning server for developers and data scientists to build and deploy predictive applications in a fraction of the time.
  • Crossdata - Data integration platform with extended DataSource API and multi-user environment.

Blogs

  • Spark Technology Center - A great source of highly diverse posts related to Spark ecosystem. From practical advices to Spark commiter profiles.

Docker Images

Miscellaneous

License

Public Domain Mark
This work (Awesome Spark, by https://github.com/awesome-spark/awesome-spark), identified by Maciej Szymkiewicz, is free of known copyright restrictions.

Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation. This compilation is not endorsed by The Apache Software Foundation.