Types of dependencies:
- Number and distance of rides by weather conditions
- Number and distance of rides per day, hour, weekday
- Taxi rides dataset, 2014, January, Yellow Taxi Trip Records
- Weather dataset, hourly data from 2012 to 2017
Before processing make sure that you have following files at locations:
- folder
./data
exists - empty folder
./data/processing
- empty folder
./data/reduced
- fodler
./data/weather
has extracted files from downloaded ZIP from here - folder
./data/yellow_tripdata_2014-01
has extracted file from
The processing of one month of rides includes following steps:
- prepare_weather - combine raw files in a single dataset (~3s)
- mapping_taxi_rides - mapping of data to groups by: hour, day, weekday, and joining with weather (~1525s, ~25m), from 2.1Gb to 6.8Gb
- reducing_taxi_rides - aggregates summary values and merging into less files (~159s, ~2.7m), from 6.8Gb to 123Kb
- plot_analysis - draw charts showing correlation between time, rides, distance and weather (~2s)
Under ./src
you can find Python project which perform the same processing, utilizing multithreading. Before starting it over, make sure folders have the state as decribed above, and the varibale folder_data
set to full absolute path to data
folder.
- slicing of a big file to smaller in memory optimized appraoch (not loading the whole huge file to memory)
- reducing files to aggregated ones for analysis
- grouping of panda dataframe by multiple columns
- applying different aggregation functions during the same grouping
- filling gaps in pandas with interpolation