Spark-Tableau-Application

Big Data Application to analyse several large datasets in Apache Spark and visualize the results in Tableau.
Analyse Cabs Trends with Airports, Events and Businesses in New York City

Tools and Framework -

Big data Tools and framework such as Scala with Spark are used to write the code for the application.
Data storage system is HDFS
Spatial joins on non big datasets is done using ArgGis
Visualisation is done using Tableau

Dataset Sources -

NYC Yellow Taxi Data - NYC Open Data (Abhinav)
Uber Pickups - Kaggle (Abhinav)
Lyft Pickups - Kaggle (Abhinav)
Legally Operating Businesses - NYC Open Data (Siddhant)
Permitted Events - NYC Open Data (Siddhant)
Airports Flight Data - https://www.transtats.bts.gov/ (Abhinav and Siddhant)

Dataset structure

datasets/
        lyft/
        uber/
        yellow_taxi/
        business/
        events/
        airports/

Run Data cleaning Job -

Submit the job running the Main class - Cleaning.scala

The cleaned data is stored in csv style formatting -
- First line is the headers of the file
- The rest of the data is comma separated values

Dataset structure for storing cleaned data

datasets/
        cleanedTaxiDataset/
                           lyft/
                           uber/
                           yellow_taxi/
        cleanedEBDataset/
                         events/
                         business/
                         events_address/
        cleanedFlightDataset/
                             nyc_airports/

cleanedTaxiDataset was generated by Abhinav. cleanedEBDataset was generated by Siddhant. cleanedFlightDataset was generated by Abhinav and Siddhant.

Run Profiling Job

Submit the job running the Main class - Profiling.scala
Dataset structure for storing cleaned data

datasets/
        profiling/

Abhinav - Profiling of taxi, uber, Airports and lyft dataset Siddhant - Profiling of Events and Businesses dataset

Mapping Dataset Structure

Mapping is done using zip code. We have done mapping using 3 different ways -

ArcGIS tool to spatial map the taxi zones to zip codes.
BingMaps developer API to generate zip codes from addresses in events dataset. This is added as a part of the cleaning job. You are required to provide a Bing Maps API key to run event mapping. This is turned off by default.
Bing Maps developer API to generate zip codes from lat/long. This is added as a part of the cleaning job. You are required to provide a Bing Maps API key to run event mapping. This is turned off by default.
Use switch -geocode to run bing maps API
Use switch -bingKey to provide key for bing maps API

Airports dataset contained Airport ID and unique identifiers such as "JFK". These were mapped to the 3 taxi zones of the area. This was used to join the dataset to cabs

The Mapping files are generated as independent datasets to be later used by analysis for the join and calculations.

datasets/
        uber_data_mapping/
        yellow_taxi_join/
        events_address_mapping/
        events_address/

Abhinav - uber and yellow taxi mapping. Siddhant and Abhinav - events address mapping using Bing API.

Running Analytics Job

Submit the job running the Main class - RunAnalytics.scala
The Job runs to pick up cleaned and mapped files to join the datasets to create tables for analysis and plots. Dataset structure for analytics data

datasets/
        Analytics/
                  comparision/
                              do_flight_count/
                              do_flight_passenger/
                              grouped_datasets/
                                               joinedAirportTaxi_DO/
                                               joinedAirportTaxi_PU/
                              pu_flight_count/
                              pu_flight_passenger/
                              zip_joined_2018/
                  event_address_join/
                  exploration/
                              business_year/
                              business_zip/
                              events_year/
                              events_zip/
                              lyft_year/
                              lyft_zip/
                              taxi_year/
                              taxi_zip/
                              uber_year/
                              uber_zip/
                  grouped_datasets/
                                   airport_dest_date/
                                   airport_origin_date/
                                   taxi_drop_date_airport/
                                   taxi_pickup_date_airport/
                  yellow_taxi/
                  uber_join/
                  lyft_join/
                  joined_datasets/
                                  join_event_taxi_normalised_ts/
                                  join_event_taxi_ts/
                                  outer_event_taxi_normalised_ts/
                                  outer_event_taxi_ts/

Abhinav and Siddhant

Running Comparision Job

Submit the job running the Main class - Comparision.scala
Run this job after Analytics Job. This job will use the files generated by the analytics job to further compare the datasets based on mathematical comparisons such as correlation between datasets etc.

        Analytics/
                  comparision/
                              do_flight_count/
                              do_flight_passenger/
                              grouped_datasets/
                                               joinedAirportTaxi_DO/
                                               joinedAirportTaxi_PU/
                              pu_flight_count/
                              pu_flight_passenger/
                              zip_joined_2018/

Abhinav and Siddhant

Analytics and inferences -

Link to the Paper - https://drive.google.com/open?id=18uYjVGQr0e9sbEtLXJsPhEm_JbMKN3nM

Actuation and Remediation

This work like a realtime data update that the user may require. This can easily be extended over to an API to provide analytics to computer and mobile devices.

One can query various analytics for their purpose and needs. These will help every cab consumers and the taxi Stakeholders and drivers' to plan and drive more intelligently for maximised profits. Below are a few example switches that can be used to query the details. These can be queried with the file ActuationRemediation.scala use -acre to get the results

-event : To get details based on events. Required for event details
-event_zip : Get events at your zip
-event_year : Get events that happened in the query year
-corr_taxi_ap : Get the Correlation of Taxi count with number of flights for every month (in asc order) for the year 2018
-corr_taxi_ap : Get the Correlation of Passenger count with number of flights for every month (in asc order) for the year 2018
-business : To get details based on business. Required for business details
-b_zip : Get business at your zip
-b_year : Get business that happened in the query year
-y_taxi : To get details based on yellow taxis. Required for event details
-y_taxi_zip : Get yellow taxis at your zip
-y_taxi_year : Get yellow taxis that happened in the query year
-uber : To get details based on uber cabs. Required for event details
-uber_zip : Get uber cabs at your zip
-lyft : To get details based on lyft cabs. Required for event details
-lyft_zip : Get lyft cabs at your zip -event : -event : -event : -event : -event : -event : -event :

Visualisation

We use the tool tableau for Visualisation. Below are a few screenshots from Tableau -

Abhinav and Siddhant

Acknowledgements

We would like to thank the Department of Computer Science at NYU Courant Institute of Mathematical Science and the NYU High-Performance Computing group for supporting this project. We would also like to thank Prof. Suzzane Macintosh who has supported and provided key insights to help make this project a success.

We would also extend our gratitude to the NYC Open Data team to provide free access to the data. We would also thank transstats.bts.gov to provide historical data for various flights at various airports.

siddpatny/Spark-Tableau-Application

Spark-Tableau-Application

Tools and Framework -

Dataset Sources -

Dataset structure

Run Data cleaning Job -

Run Profiling Job

Mapping Dataset Structure

Running Analytics Job

Running Comparision Job

Analytics and inferences -

Actuation and Remediation

Visualisation

Acknowledgements

References