GeoSpark is a cluster computing system for processing large-scale spatial data. GeoSpark extends Apache Spark / SparkSQL with a set of out-of-the-box Spatial Resilient Distributed Datasets (SRDDs)/ SpatialSQL that efficiently load, process, and analyze large-scale spatial data across machines.

GeoSpark contains several modules:

Name API Spark compatibility Introduction
Core RDD Spark 2.X/1.X SpatialRDDs and Query Operators.
SQL SQL/DataFrame SparkSQL 2.1+ SQL interfaces for GeoSpark core.
Viz RDD, SQL/DataFrame RDD - Spark 2.X/1.X, SQL - Spark 2.1+ Visualization for Spatial RDD and DataFrame.
Zeppelin Apache Zeppelin Spark 2.1+, Zeppelin 0.8.1+ GeoSpark plugin for Apache Zeppelin

GeoSpark supports several programming languages: Scala, Java, SQL, Python and R.

  • GeoSpark main developer Jia Yu will be a Tenure-Track Assistant Professor of Computer Science at Washington State University. He is looking for PhD students to join his lab! (read this)
  • GeoSpark 1.3.1 is released. This version provides a complete Python wrapper to GeoSpark RDD and SQL API. It also contains a number of bug fixes and new functions from 12 contributors. See Python tutorial: RDD, Python tutorial: SQL, Release note

