This repo is to analyze the Chicago Taxi data and predict future tax revenue per taxi
This code repository is maintained as a python package. One can simply install it by using
pip install .
However, as it's still under development, it's recommended to use following command to install it
pip install -e .
In the editable mode, python won't try to move the entire folder into site-pacakge
. A soft link will be created and any ad-hoc changes in the code can be direclty reflected in your work environments.
For jupyter and ipython, you can open the reload option, then any adhoc changes will be dynamically loaded without reimport.
%load_ext autoreload
%autoreload 2
When initializing the project, chitaxi
can bind itself with a local folder, any input and output files can be directly looked up in the given workplace.
This can be done manually by revising the config.yaml
file, or use CLI command
chitaxi --config-data PATH_TO_WORKPLACE
Options:
--config-reset
--config-data TEXT
--clean-taxi TEXT
--help Show this message and exit.
The original data is huge, around 6 - 10GB per month and can be downloaded from here: https://data.cityofchicago.org/Transportation/Taxi-Trips/wrvz-psew/data
Simply put all downlaoded CSV files into one clean folder, use CLI can clean up the file and aggregate them into a HDF5 format dataset and put the h5 file into workspace.
Note:
- The dataset is large, expected final output is above 20GB
- If the RAM is limited, you have to split each CSV as small as possible. This can be done by using date filters on chicago website
chitaxi --clean-taxi PATH_TO_CSV_FOLDER
Alternatively, one can directly download the pre-cleaned h5 format here:
Under your workspace:
.
├── chitaxi.h5
Then you slice the data into any python kernel:
from chitaxi.datasets.loader import get_data_taxi
# Get data within a range
df = get_data_taxi(start='20140101', end='20140201')
# Get data for a specific year
df = get_data_taxi(year=2015)
Make sure you have the correct hdf
format dataset in your CONFIG-DATA
folder.
The feature and label data can be directly downloaded from here