"NYC Taxi Trip Duration Prediction Using Machine Learning"
This project aims to accurately predict taxi trip durations in New York City using a variety of machine learning techniques. By analyzing a comprehensive dataset of taxi trips, we develop models that consider factors such as pickup and dropoff locations, trip distances, time of day, and traffic conditions. Our goal is to enhance ride-sharing efficiency and improve urban mobility planning.
- NYC Taxi Trip Records: Detailed trip data including pickup and dropoff coordinates, trip distances, and durations.
- OpenStreetMap (OSRM): Road network data used for calculating route distances and expected travel times.
- NYC Weather Data: Historical weather information to examine its impact on trip durations.
- Data Preprocessing: Cleaning, feature extraction, and normalization of taxi trip and external datasets.
- Feature Engineering: Creating new features like trip distance from coordinates, time of day, day of the week, and weather conditions.
- Exploratory Data Analysis (EDA): Analyzing the datasets to uncover patterns and relationships that inform our modeling strategy.
- Model Development: Training and evaluating several models, including Decision Trees, Random Forest, Gradient Boosting, and XGBoost.
- Model Tuning: Hyperparameter optimization to improve model performance.
- Evaluation: Using Root Mean Squared Logarithmic Error (RMSLE) to assess model accuracy.
- Python: Main programming language for data processing and modeling.
- Pandas & NumPy: For data manipulation and numerical calculations.
- Scikit-learn: For machine learning model implementation and evaluation.
- XGBoost: For advanced gradient boosting model.
- Matplotlib & Seaborn: For data visualization.
Discussion of the best performing models and their practical implications for taxi companies and city transportation planning.
Instructions on setting up the project environment, including required libraries and how to run the scripts.
Examples of how to execute the modeling pipeline, from data preprocessing to making predictions.
Guidelines for contributing to the project, including how to propose improvements and submit pull requests.
The project is distributed under the MIT license. You can freely use and distribute this code for personal and commercial purposes with a mandatory link to the author.
Credits to data providers, contributors, and any references used in the development of this project.