Spatio-Temporal Data Processing on Apache Spark/Apache Sedona

The purpose of this repository is to perform spatial and temporal data processing steps using Apache Sedona on the raw dataset of NYC Taxi Trip records (https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page). The final goal is to convert the dataset into two tensors: i) spatio-temporal tensor where total spatial NYC taxi zone is divided into a number of polygons/zones, and number of taxi pickups happened at each of the spatial zones are recorded in a temporal requence of 1 hour interval, ii) spatial grid tensor where spatial NYC taxi zone is converted into a grid of m by n cells, and each cell in the grid contains a feature vector of taxi trip records,

Downloading Dataset

In order to experiment with NYC taxi trip dataset, download the 'Yellow Taxi Trip Records' for January 2009 from the site: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page. Put the downloaded CSV file inside this folder 'data/taxi_trip'. The file name should be 'yellow_tripdata_2009-01.csv'. Other datasets are available under 'data' folder.

Utility Methods

At the beginning, some utility methods are defined some of which are based on Apache Sedona. This methods are required for spatial data processing such as getting adjacency of polygons and/or points based on various weighting scheme, partitioning a spatial RDD into a number of polygons, getting a list of geometry objects from a spatial RDD, getting Moran's I of an attribute in a spatial dataset, etc.

Spatio-Temporal Data Processing Steps

At first, it loads the shape file of NYC taxi zones as a spatial RDD and adds an index starting from 0 as an identifier of the taxi zones. After loading the CSV file containing the trip records of January 2009, it selects necessary columns required for our feature vector along with adding an identifier. The Trip_Pickup_DateTime column is converted to timestamp, and a range of timestamps is created with an interval of 1 hour. Latitudes and longitudes in the dataset are converted to spatial Point geometries in apache sedona and timestamp sequences are joined with trip records to assign each pickup the corresponding timestamp. The whole trip record dataset is converted to another spatial RDD where geometry objects are Points. Later, a spatial join is performed between two spatial RDDs using Apache Sedona to identify which trip pickups happened at which trip zones/polygons. After grouping the join result based on timestamps and taxi zones, we get the final spatio-temporal array of taxi pickups (Task i).

Spatio Data Processing Steps

Differences from previous step are that we create a spatial grid of m by n cells instead of working with default taxi zones in the shape file. The grid is represented as a spatial RDD. Unlike the task i, no temporal sequences are created. Similar to task 1, a spatial join using Apache sedona is performed between grid polygon RDDs and trip pickup point RDDs to find the taxi pickups associated with each grid cells. After grouping by grid cell/polygon ids, we get the number of pickups at each grid cell which is our goal for task ii.