/unimelb.mast30034.project-1-DigitalData

mast30034-project-1-DigitalData created by GitHub Classroom

Primary LanguageHTML

MAST30034 Project 1 README.md

  • Name: Xavier Travers
  • Student ID: 1178369

Research Goal

To determine the effects that the virus case rates (COVID-19 and Influenza) have on the distance of trips in New York yellow taxis per borough per week.

Timeline

The timeline for the research area is starting January 2020 and ending December 2021 (See the report for justification).

Pipeline

Run all the scripts from the repository's root directory (do not cd into the scripts folder).

  1. download.py: Downloads the raw data into the data/raw directory. Run with
python3 ./scripts/download.py
  1. generate_mmwr_weeks.py: Generates a data/raw/mmwr_weeks.parquet which is used for aggregation by week (where the Influenza data is already grouped by CDC/MMWR week). Run with
python3 ./scripts/generate_mmwr_weeks.py
  1. notebooks/preprocessing/preprocessing_part_1_cleaning.ipynb: Cleans the dataset (removes rows containing null and negative values where necessary).
  2. notebooks/preprocessing/preprocessing_part_2_aggregation.ipynb: Groups the datasets by MMWR week and pick-up borough.
  3. The data analysis notebooks: These can be explored in any order (since they do not change data, only generating plots).
    • notebooks/data_analysis/data_analysis_distance_distribution.ipynb: Related to finding the distribution of trip distances.
    • notebooks/data_analysis/data_analysis_distance_vs_time.ipynb: Plots the trip distances over time.
    • notebooks/data_analysis/data_analysis_geospatial_distance_mapping.ipynb: Maps the average trip radii per borough.
    • notebooks/data_analysis/data_analysis_viral_cases_vs_time.ipynb: Plots the viral case rates over time.
    • notebooks/data_analysis/data_analysis_distance_modelling.ipynb: Generates the linear models of trip distances.
    • notebooks/data_analysis/data_analysis_trip_rates_vs_time.ipynb: Plots the trip rates over time. This is not used in the report.

Python Scripts

There are several scripts located in the scripts folder. These have enough commenting to not need a breakdown of each here.

Main Python Modules

These are used throughout the code and should be installed before running. For a more detailed snapshot of the modules I have installed when running my code, see the requirements.txt.

  • pyspark
  • pandas
  • matplotlib
  • statsmodels
  • geopandas
  • folium
  • numpy