/Thyroid_disease_Prediction_ML

This project develops an advanced predictive model to identify thyroid disease recurrence using machine learning algorithms. We used a detailed dataset with demographic, medical, and clinical features, and implemented Logistic Regression, Decision Tree, Random Forest, and CatBoost Classifier. Rigorous preprocessing and EDA were performed.

Primary LanguageJupyter Notebook

Title

Advanced Predictive Modeling for Thyroid Disease Recurrence Using Machine Learning Algorithms

Description

This project entails the development of an advanced predictive model to identify the recurrence of thyroid disease using state-of-the-art machine learning algorithms. Leveraging a detailed thyroid dataset comprising demographic information, medical history, and clinical features, we implemented multiple machine learning techniques, including Logistic Regression, Decision Tree, Random Forest, and CatBoost Classifier. The dataset underwent rigorous preprocessing, which involved label encoding binary categorical variables and ordinal encoding other categorical features. After preprocessing, the data was split into training and testing subsets to facilitate model training and evaluation. Each model was trained on the training data to predict the 'Recurred' column and evaluated based on precision and recall scores. The CatBoost Classifier demonstrated superior performance with a notable accuracy score of 97.27%. Furthermore, comprehensive Exploratory Data Analysis (EDA) was conducted to gain insights into the data distribution, feature importance, and potential correlations.

Objective

The primary objective of this project is to construct a reliable and accurate predictive model for detecting the recurrence of thyroid disease in patients. The specific goals are as follows:

  1. Data Preprocessing:

    • Label Encoding: Convert binary categorical variables such as 'Smoking', 'Radiotherapy History', and 'Recurred' from Yes/No to 0/1.
    • Ordinal Encoding: Apply ordinal encoding to ordered categorical variables to make them suitable for machine learning algorithms.
    • Feature Selection: Identify and select the most relevant features for model training.
  2. Exploratory Data Analysis (EDA):

    • Descriptive Statistics: Calculate summary statistics to understand the central tendency, dispersion, and distribution of features.
    • Visualization: Use histograms, box plots, and correlation matrices to visualize data distributions and relationships between features.
    • Feature Importance: Determine the importance of various features using techniques like mutual information scores and correlation analysis.
  3. Model Training and Evaluation:

    • Logistic Regression: Implement logistic regression to establish a baseline model for binary classification.
    • Decision Tree: Develop a decision tree model to capture non-linear relationships in the data.
    • Random Forest: Utilize the ensemble learning approach of random forests to improve prediction accuracy and reduce overfitting.
    • CatBoost Classifier: Apply CatBoost, an advanced gradient boosting algorithm, known for its superior handling of categorical features and high accuracy.
  4. Performance Metrics:

    • Accuracy: Measure the overall correctness of the model's predictions.
    • Precision: Calculate the ratio of true positive predictions to the sum of true positive and false positive predictions.
    • Recall: Calculate the ratio of true positive predictions to the sum of true positive and false negative predictions.
    • F1 Score: Compute the harmonic mean of precision and recall to balance both metrics.
  5. Model Comparison and Selection:

    • Compare the performance of all implemented models using precision, recall, and F1 score.
    • Select the CatBoost Classifier as the final model based on its highest accuracy score of 97.27% and superior performance in other evaluation metrics.
  6. Final Model Deployment:

    • Prepare the CatBoost Classifier for deployment by validating it on unseen test data and ensuring its robustness and reliability in predicting thyroid disease recurrence.

This detailed and technical approach ensures a comprehensive understanding and application of machine learning techniques to predict thyroid disease recurrence effectively.