Big Data Application to analyse several large datasets in Apache Spark and visualize the results in Tableau.
Analyse Cabs Trends with Airports, Events and Businesses in New York City
- Big data Tools and framework such as Scala with Spark are used to write the code for the application.
- Data storage system is HDFS
- Spatial joins on non big datasets is done using ArgGis
- Visualisation is done using Tableau
- NYC Yellow Taxi Data - NYC Open Data (Abhinav)
- Uber Pickups - Kaggle (Abhinav)
- Lyft Pickups - Kaggle (Abhinav)
- Legally Operating Businesses - NYC Open Data (Siddhant)
- Permitted Events - NYC Open Data (Siddhant)
- Airports Flight Data - https://www.transtats.bts.gov/ (Abhinav and Siddhant)
datasets/ lyft/ uber/ yellow_taxi/ business/ events/ airports/
Submit the job running the Main class - Cleaning.scala
- The cleaned data is stored in csv style formatting -
- First line is the headers of the file
- The rest of the data is comma separated values
Dataset structure for storing cleaned data
datasets/ cleanedTaxiDataset/ lyft/ uber/ yellow_taxi/ cleanedEBDataset/ events/ business/ events_address/ cleanedFlightDataset/ nyc_airports/
cleanedTaxiDataset was generated by Abhinav. cleanedEBDataset was generated by Siddhant. cleanedFlightDataset was generated by Abhinav and Siddhant.
Submit the job running the Main class - Profiling.scala
Dataset structure for storing cleaned data
datasets/ profiling/
Abhinav - Profiling of taxi, uber, Airports and lyft dataset Siddhant - Profiling of Events and Businesses dataset
Mapping is done using zip code. We have done mapping using 3 different ways -
- ArcGIS tool to spatial map the taxi zones to zip codes.
- BingMaps developer API to generate zip codes from addresses in events dataset. This is added as a part of the cleaning job. You are required to provide a Bing Maps API key to run event mapping. This is turned off by default.
- Bing Maps developer API to generate zip codes from lat/long.
This is added as a part of the cleaning job. You are required to provide a Bing Maps API key to run event mapping. This is turned off by default.
Use switch -geocode to run bing maps API
Use switch -bingKey to provide key for bing maps API
Airports dataset contained Airport ID and unique identifiers such as "JFK". These were mapped to the 3 taxi zones of the area.
This was used to join the dataset to cabs
The Mapping files are generated as independent datasets to be later used by analysis for the join and calculations.
datasets/ uber_data_mapping/ yellow_taxi_join/ events_address_mapping/ events_address/
Abhinav - uber and yellow taxi mapping. Siddhant and Abhinav - events address mapping using Bing API.
Submit the job running the Main class - RunAnalytics.scala
The Job runs to pick up cleaned and mapped files to join the datasets to create tables for analysis and plots.
Dataset structure for analytics data
datasets/ Analytics/ comparision/ do_flight_count/ do_flight_passenger/ grouped_datasets/ joinedAirportTaxi_DO/ joinedAirportTaxi_PU/ pu_flight_count/ pu_flight_passenger/ zip_joined_2018/ event_address_join/ exploration/ business_year/ business_zip/ events_year/ events_zip/ lyft_year/ lyft_zip/ taxi_year/ taxi_zip/ uber_year/ uber_zip/ grouped_datasets/ airport_dest_date/ airport_origin_date/ taxi_drop_date_airport/ taxi_pickup_date_airport/ yellow_taxi/ uber_join/ lyft_join/ joined_datasets/ join_event_taxi_normalised_ts/ join_event_taxi_ts/ outer_event_taxi_normalised_ts/ outer_event_taxi_ts/
Abhinav and Siddhant
Submit the job running the Main class - Comparision.scala
Run this job after Analytics Job. This job will use the files generated by the analytics job to further compare the
datasets based on mathematical comparisons such as correlation between datasets etc.
Analytics/ comparision/ do_flight_count/ do_flight_passenger/ grouped_datasets/ joinedAirportTaxi_DO/ joinedAirportTaxi_PU/ pu_flight_count/ pu_flight_passenger/ zip_joined_2018/
Abhinav and Siddhant
Link to the Paper - https://drive.google.com/open?id=18uYjVGQr0e9sbEtLXJsPhEm_JbMKN3nM
This work like a realtime data update that the user may require. This can easily be extended over to an API to provide analytics to computer and mobile devices.
One can query various analytics for their purpose and needs. These will help every cab consumers and the taxi Stakeholders and drivers' to plan and drive more intelligently for maximised profits. Below are a few example switches that can be used to query the details. These can be queried with the file ActuationRemediation.scala use -acre to get the results
- -event : To get details based on events. Required for event details
- -event_zip : Get events at your zip
- -event_year : Get events that happened in the query year
- -corr_taxi_ap : Get the Correlation of Taxi count with number of flights for every month (in asc order) for the year 2018
- -corr_taxi_ap : Get the Correlation of Passenger count with number of flights for every month (in asc order) for the year 2018
- -business : To get details based on business. Required for business details
- -b_zip : Get business at your zip
- -b_year : Get business that happened in the query year
- -y_taxi : To get details based on yellow taxis. Required for event details
- -y_taxi_zip : Get yellow taxis at your zip
- -y_taxi_year : Get yellow taxis that happened in the query year
- -uber : To get details based on uber cabs. Required for event details
- -uber_zip : Get uber cabs at your zip
- -lyft : To get details based on lyft cabs. Required for event details
- -lyft_zip : Get lyft cabs at your zip -event : -event : -event : -event : -event : -event : -event :
We use the tool tableau for Visualisation. Below are a few screenshots from Tableau -
Abhinav and Siddhant
We would like to thank the Department of Computer Science at NYU Courant Institute of Mathematical Science and the NYU High-Performance Computing group for supporting this project. We would also like to thank Prof. Suzzane Macintosh who has supported and provided key insights to help make this project a success.
We would also extend our gratitude to the NYC Open Data team to provide free access to the data. We would also thank transstats.bts.gov to provide historical data for various flights at various airports.
- NYC Open Data
- https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
- https://www.transtats.bts.gov/
- https://data.cityofnewyork.us/City-Government/NYC-Permitted-Event-Information-Historical/bkfu-528j
- https://www.kaggle.com/fivethirtyeight/uber-pickups-in-new-york-city
- https://data.cityofnewyork.us/Business/Legally-Operating-Businesses/w7w3-xahh
- https://www.kaggle.com/fivethirtyeight/uber-pickups-in-new-york-city
- https://data.cityofnewyork.us/City-Government/NYC-Permitted-Event-Information/tvpp-9vvx
- Spark: The Definitive Guide By Bill Chambers and Matei