This repository contains the source code and data for a project that aims to predict taxi demand around Central Park using time series data. The project focuses on converting the time series data into a tabular format and utilizing it to predict the number of pickups for a given hour based on the previous hour's data.
The data folder contains two subdirectories: raw
and transformed
.
-
raw
: This directory contains raw data files in Parquet format representing taxi rides for each month of the year 2022. The files are named as follows:rides_2022-MM.parquet
, whereMM
represents the month. -
transformed
: This directory contains transformed data files in Parquet format, including the final tabular data used for modeling (tabular_data.parquet
). Additionally, there are intermediate files generated during the data transformation process. -
taxi_zones.csv
: is required for the geolocation of the taxi zones.
The notebooks
directory consists of Jupyter notebooks used for different stages of the project:
-
00_functions.ipynb
: A notebook containing custom functions used throughout the project. -
01_load_and_validate_raw_data.ipynb
: A notebook for loading and validating the raw data. -
02_transform_raw_data_to_time_series.ipynb
: A notebook that transforms raw data into time series format. -
03_time_series_data.ipynb
: A notebook exploring and analyzing time series data. -
04_transform_raw_data_into_features_and_targets.ipynb
: A notebook responsible for feature engineering. -
05_visualize_training_data.ipynb
: A notebook used to visualize the training data. -
06_baseline_model.ipynb
: A notebook implementing a baseline model for prediction. -
07_XGBoost_model.ipynb
: A notebook presenting an XGBoost model for prediction. -
08_catboost.ipynb
: A notebook demonstrating the CatBoost model. -
09_catboost_model_with_feature_engineering.ipynb
: A notebook combining CatBoost with feature engineering. -
10_catboost_with_hyperparameter_tuning.ipynb
: A notebook for hyperparameter tuning of CatBoost.
The catboost_info
directory stores additional files related to CatBoost training.
To set up the environment for this project, you can use poetry
to install the required dependencies. Use the provided pyproject.toml
and poetry.lock
files to manage dependencies. Run the following command to create the environment:
poetry install
After installing the required dependencies, you can use the Jupyter notebooks in the notebooks
directory to explore the project and run the code cells sequentially.
Contributions to this project are welcome! If you find any issues or have ideas for improvements, please open an issue or submit a pull request to contribute.