ACM Sigspatial Cup 2016

Distributed computation on Geo-spatial data

This repository contains an implementation for the problem statement of ACM Signspatial 2016 to calculate the top 50 hotspots for New York city cabs with some constraints.

The location considered for this problem was the pickup location
The cell attribute value was the number of journeys instead of the number of passengers
The step size was of 1 day
The input data was restricted to January 2015
Neighbor weight was equal and fixed to 1

The metric for calculation of the hotspots was the Getis-Ord score and the implementation was done on Apache Spark.

Algorithm

The algorithm we developed allowed complete parallelization of the computation of the Getis Ord score using 2 Mapreduce phases:

Mapreduce Phase 1: Creation of cells
Map step: Parse each row from the input file and output a cell object having ID as the X (Latitude), Y (Longitude) and Z (Day) coordinates of the journey
Filter those cells which were outside a fixed geo-envelope
Reduce step: Add up all the journeys in a cell to calculate the attribute value for the cell
Intermediate Calculations
Calculate the mean and standard deviation from the cell information
Calculate the total number of cells in the cube
Mapreduce Phase 2: Calculating the Getis-Ord score
The key step in the calculation of the Getis-Ord score was finding the neighboring cells and obtaining their attribute value.
The magic sauce to getting the neighbor attribute value was as follows: 1. Map step: For each cell, calculate the IDs of its neighbors and send out its attribute value to all of its neighbors 2. Reduce step: For each cell during the reduce step, all of its neighbor values would have sent their values to this cell making all required data for calculating the Getis-Ord score available. The computation of the score was done locally for each cell
Retrieving top 50 hotspot cells
The top 50 hotspots were retrieved using the top function of Apache Spark since this was an efficient way to lookup and retrieve top-k values from an RDD

HawxChen/acm-sigspatial-cup-2016

ACM Sigspatial Cup 2016

Distributed computation on Geo-spatial data

Algorithm