- Nasko Apostolov
- Audrey Bertin
- Nelson Evbarunegbe
- Raymond Li
Hyperopt and Optuna are two Python-based tools designed to help with the step of hyperparameter optimization in a machine learning pipeline. Both have UI features that allow users to compare the performance of different parameter combinations, and both integrate with standard Python ML libraries, including scikit-learn
and Tensorflow
.
Official resources for Hyperopt and Optuna are provided below, and more information on these tools (and hyperparmaeter tuning in general) can be found in the README within the /experiment
folder of this project.
-
GitHub Repository: https://github.com/hyperopt/hyperopt
-
Documentation: http://hyperopt.github.io/hyperopt/
-
GitHub Repository: https://github.com/optuna/optuna
-
Website (includes documentation and tutorials): https://optuna.org
In this project, we compare the two tools to one another AND to how many people might complete the hyperparameter optimization without a tool (grid search).
There are three different parts to this comparison:
- A code review of the official repositories
- A UI review where we look at the UI features available in both tools
- An experiment where we run a full machine learning pipeline using each of our tools--Grid Search, Hyperopt, and Optuna--to compare algorithm speed and performance (in terms of improving model scores). For each tool, we test it on two separate datasets, one for classification and one for regression.
Note: In the main folder, there is a requirements.txt
file that contains a list of all of the packages we installed in the virtual environment used to generate and run the code for this project. You can recreate the environment by installing these packages with pip install requirements.txt
.
The /data
folder contains the data used in the experiment. There is one dataset for classification, under the /classification
subfolder, and one dataset for regression under the /regression
subfolder.
Both datasets come from Kaggle, at the following links:
Classification -- predicting whether a patient will be readmitted to the hospital based on data from their last hospital visit: https://www.kaggle.com/code/iabhishekofficial/prediction-on-hospital-readmission
Regression -- predicting CO2 emissions from vehicles based on features of the car (engine size, fuel type, etc.): https://www.kaggle.com/datasets/debajyotipodder/co2-emission-by-vehicles
In each subfolder, there is a raw version of the data as directly downloaded from Kaggle, as well as a cleaned version, which has been reformatted and simplified into a format that can be processed by a random forest model. There is also a Jupyter notebook that includes all of the code used to convert the data from the original format to the cleaned format.
In the /experiment
folder, we store all of the code used to run the machine learning pipelines for each of the experimental conditions: Grid Search, Hyperopt, and Optuna.
In the /report
subfolder we store all of the files used to generate the midpoint report and final report, including the files used to write the documents, any images that are embedded, and PDF outputs.
In addition to this repository, we also have a project Google Folder which stores additional items.
This folder contains slides shared at midpoint and final presentations, as well as any other media such as recordings.
https://drive.google.com/drive/folders/1knJQ2KCqIkrWExxXIYlPUy4Rd-sNyZU0?usp=share_link