/cardioML

Machine Learning and Data Science project on Cardio dataset from Kaggle: https://www.kaggle.com/datasets/sulianova/cardiovascular-disease-dataset

Primary LanguageJupyter Notebook

In-Depth Analysis using Multiple ML Models on likelihood of individual obtaining Cardiovascular Disease

Dataset obtained from Kaggle under: https://www.kaggle.com/datasets/sulianova/cardiovascular-disease-dataset Can be viewed online at https://nbviewer.org/github/garrysjh/IE0005_Final_Project/blob/main/Cardio_DSAI2.ipynb

Prompt:

The number of deaths from CVD increased by 42.4% from 1990 to 2015. On the other hand, CVD led to over 17 million deaths, 330 million years of life lost and 35.6 million years lived with disability in 2017 worldwide. Meanwhile, it was projected that CVD would be the cause of more than 23 million deaths in 2030 around the world (Maedeh, Farid and Masoud, 2021)

Source: https://bmcpublichealth.biomedcentral.com/articles/10.1186/s12889-021-10429-0

Hence, our group decided to choose the dataset on cardiovascular disease, and use machine learning models to find out whether there are any factors that strongly affect the presence of cardiovascular disease, and henceforth use these machine learning models to identify people with early risks of cardiovascular disease so that they can seek treatment early.

The dataset we obtained is the Cardiovascular Disease Dataset obtained from Kaggle

Source: https://www.kaggle.com/datasets/sulianova/cardiovascular-disease-dataset?datasetId=107706&sortBy=voteCount

We came up with 2 questions to answer in our project, namely:

How do each of the variables (risk factors) affect having CVD?

Based on a person’s health profile, can we predict if a person is likely to have CVD?

In doing so, we also observe relationships between certain individual variables in the midst of our analysis.

Machine Learning Models Used:

3.1 Logistic Regression

3.2 Decision Tree

3.3 Random Forest

3.4 XGBoost

3.5 Naive Bayes

3.6 KNN