/diabetes_diagnosis

Create a predictive model for diagnosing diabetes using random forest algorithms

Primary LanguageJupyter NotebookMIT LicenseMIT

🧑‍⚕️ Development of a Diabetes Diagnosis Algorithm

Introduction to Data Science Course, supervised by Professor Anthony Christidis
November 2023

Overview

Welcome to the repository of the project "Development of a Diabetes Diagnosis Algorithm". This project was developed as part of the Introduction to Data Science course under the supervision of Professor Anthony Christidis. The primary goal of this project is to create a predictive model for diagnosing diabetes using random forest algorithms.


Features

🔍 Predictive Model Development

  • Random Forest Algorithms: Implemented to develop a robust predictive model for diabetes diagnosis.
  • Variable Selection: Conducted analysis to identify and select important variables affecting the diagnosis.

📉 Dimensionality Reduction

  • PCA (Principal Component Analysis): Applied PCA to reduce the dimensionality of the dataset, enhancing the model's performance.

🛠 Hyperparameter Tuning

  • Hyperparameters Explored: Focused on finding optimal values for mtry (number of variables randomly sampled as candidates at each split) and min_n (minimum size of terminal nodes) to improve model accuracy.

📊 Visualization

  • Decision Tree Models: Visualized decision trees to identify key variables significantly impacting the outcomes of diabetes diagnosis.

Project Structure

1. Data Collection and Preparation

  • Data Source: Utilized a diabetes dataset from a reputable medical data repository.
  • Preprocessing: Cleaned and preprocessed the data to ensure quality inputs for the model.

2. Model Development

  • Random Forest Implementation: Developed the predictive model using random forest algorithms.
  • Variable Selection and PCA: Selected important variables and applied PCA for dimensionality reduction.

3. Hyperparameter Tuning

  • Optimal Values for mtry and min_n: Explored various hyperparameters to find the optimal values for improving model performance.

4. Visualization and Analysis

  • Decision Tree Visualization: Visualized decision trees to identify and understand key variables affecting the diagnosis.

License

This project is licensed under the MIT License - see the LICENSE file for details.


Acknowledgments

  • Professor Anthony Christidis for his guidance and supervision.
  • The Introduction to Data Science course for the opportunity to develop this project.