/Diabetes-Supervised-Machine-Learning-Analysis-And-Prediction

Comparing logistic regression, decision tree, random forest, k-nearest neighbors, and SVMs in regard to binary prediction performance metrics.

Primary LanguageJupyter Notebook

COGS118A Final Project

Binary Classification: Diabetes Prediction

Comparing logistic regression, decision tree, random forest, k-nearest neighbors, and SVMs

Abstract:

This project aims to solve the difficulty associated with making accurate diagnoses of diabetes in patients. If a patient is incorrectly diagnosed, it could lead to dire consequences, such as additional health issues or even death. Our goal is to solve this problem by designing machine learning algorithms that will accurately predict whether a patient has diabetes. Our data encompasses eight features such as age, gender, body mass index(BMI), hypertension, heart disease, smoking history, HbA1c levels, and blood glucose levels, along with their diabetes status: positive or negative. These electronic health records are collected through surveys, medical records, and laboratory tests from individuals by healthcare providers in hospitals or clinics. With this data, we will train multiple binary classification algorithms and select the algorithm that provides the highest sensitivity. We will compare the performances of logistic regression, decision tree, k-nearest neighbor, and support vector machines to see which algorithm best suits our needs. We will measure performance using sensitivity, precision, specificity, ROC-AUC, and precision-recall curves with a heavy emphasis on high recall, as it is important to detect all the positive diabetes cases in order to provide immediate treatment.