/sc1015_mini_proj

Primary LanguageJupyter Notebook

SC1015 Mini-Project (AY21/22 Semester 2)

This project is done in partial completion of the module SC1015 Introduction to Data Science & Artificial Intelligence.

This is done by SC8 Team 04 which consists of:

  1. Choo Jin Cheng (U2121190C)
  2. Chua Min Min (U2121126G)
  3. Poh Shi Qian (U2122452J)

Date completed: 24 April 2022

Below is just a summary of our project. For more information, please read mini_project.ipynb.


Real-life Problem

Stroke can often be caused by unhealthy lifestyle and other health problems. Are there any unconventional causes?

According to the World Stroke Organisation (n.d.), stroke is a "leading cause of death and disability globally". In 2019 alone, there were 6.6 million people who died stroke of varying severity (American Heart Association, 2021).

While age and chronic health conditions like heart diseases are commonly known to increase the chances of a person getting a stroke, there might be unconventional factors leading to a healthy person getting a stroke. Hence, this project aims to uncover, if any, correlations between unconventional factors like marital status and a person's chance of getting a stroke.


Data Science Question

Do unconventional features help to better predict whether a person will have / already has a stroke?

This is a Classification problem. Our goal is to find out if there is any unconventional feature that makes one more likely to get a stroke.


Dataset

This dataset is extracted from Kaggle. It has the following fields:

  1. id: unique identifier
  2. gender: "Male", "Female" or "Other"
  3. age: age of the patient
  4. hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
  5. heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
  6. ever_married: "No" or "Yes"
  7. work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
  8. Residence_type: "Rural" or "Urban"
  9. avg_glucose_level: average glucose level in blood
  10. bmi: body mass index
  11. smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*
  12. stroke: 1 if the patient had a stroke or 0 if not
    *Note: "Unknown" in smoking_status means that the information is unavailable for this patient

Unconventional variables

  • ever_married
  • work_type
  • Residence_type

What did we do in this project?

  1. Exploratory Data Analysis on the features
  • Plotting of graphs
  • Statistical summaries
  • Simple calculations
  • Correlation checks
  1. Data Cleaning and Preparation
  • Removal of rows/columns
  • Replacement of values
  • Encoding (Label/One-Hot)
  1. Post-cleaning Work
  • Correlation checks
  • Feature Selection (SelectKBest)
  1. Machine Learning
  • k-Nearest Neighbors
  • XGBoost
  • Artificial Neural Network
  • Naive Bayes
  1. Conclusion


New things we tried!

  1. Chi-square test - this is for categorical features correlation check
  2. One-Hot encoding - this is for categorical features that are non-binary
  3. SelectKBest feature selection - this is to provide insights on variable importance
  4. Synthetic Minority Over-sampling Technique (SMOTE) - this is to compensate for our heavily imbalanced data
  5. k-Nearest Neighbors - model
  6. XGBoost - model
  7. Artificial Neural Network - model
  8. Naive Bayes - model

Conclusion

  • Naive Bayes is the most ideal model for this dataset
  • Unconventional features can help to better predict if a person will have / already has a stroke
  • 'work_type' is the most significant unconventional feature, followed by 'ever_married' and 'Residence_type'

References: