Speed Dating Match Prediction

This project focuses on predicting the outcome of speed dating sessions based on the profiles of two individuals. The goal is to develop a recommendation system that can better match people in speed dating events. The task is a binary classification problem, where we aim to predict the probability (ranging from 0 to 1) that a dating session will lead to a successful match.

Dataset

The dataset used in this project is clean but contains a significant amount of missing values. The strategy for handling missing values needs to be tuned. Additionally, the dataset is highly unbalanced, with a majority of the samples being unmatched.

Workflow

The project treats the complete workflow, from data preprocessing to model training, as a single pipeline. The main steps involved in the workflow are:

Data Preprocessing: Handle missing values, perform feature engineering, and address class imbalance.
Model Training: Train and tune machine learning models using various hyperparameters and configurations.
Evaluation: Assess model performance using appropriate evaluation metrics.
Model Selection: Select the best-performing model based on evaluation results.

Code Documentation

The project code includes detailed comments to explain each line of code and demonstrate a thorough understanding of the implementation. The code documentation also includes descriptions of the experimental protocol used, preprocessing steps applied, and any additional techniques used for feature engineering or data handling.

Model Tuning and Documentation

To improve the model's performance, the project follows the data science life-cycle for tuning. Different features, hyperparameters, configurations, and even models are explored. Each trial is documented, including the reason for making a specific change, the expected outcome, observed performance, and thoughts on the results.

The project includes at least six different trials, which vary in terms of feature sets, preprocessing techniques, models, and tuning methods. The requirements include covering one grid search trial, one random search trial, and one Bayesian search trial.

Deliverable