Author: Nicholas Orgel
Description: This dataset is a compilation of patients of various ages and different variables and how likely they are or have had a stroke.
Problem: Find the best model and methods on how to predict patient stroke likelyhood based on conditions given including age, work status, residency, body mass index (bmi), average level of glucose, smoking habits, marital status, and history of hypertension.
#Exploratory Data Visualizations
Pictured above is a correlation heatmap of all of the features relevant to the prediction of a stroke. The darker the shade of green, the better a correlation is.
- The features that best align with others or the stroke probabilities itself are:
- Age
- Ever Married
- BMI
- Smoking Status
- Heart Disease
##Correlation Visualizations
- There is a moderate correlation between a patient's age and their marital status. In the lineplot above, it calculates the mean or average of the percentage of patients in the dataset based on their age and marriage status.
- Many patients start to get married in their 20's and 30's.
- Nearly half of the patients in the data that are present are married by the time they reach 30 years of age.
- By age 65, the average percentage of patients who are married is close to 100%, but there is soon a slight decrease at age 70.
- The decrease goes to a percentage around 85% where one can assume that people in this age group no longer wish to be married or have never been married at all.
The next lineplot shows the correlation between a patients age and their probability of having a stroke.
- As the older a person gets their likelyhood of having a stroke increases.
- However, by early 60 years of age, the average percentage drops to almost 0%, the assumption here being that by this time patients are more considerate about their overall health for a small time.
- The most significant age where someone is likely to have a stroke is around age 80 with a 25% chance probability.
- While the lineplot gives the interpretation that infants are at high risk of a stroke, this is only from 13 values that are present in the age column that are less than 20 years of age. It has no significant impact on the risk at early developmental stages.
The lineplot above shows that when comparing patient's age and bmi and the likelyhood of having a stroke; there seems to be no major significance as only a small minority of patients the age groups of 60-80 have had a stroke.
There seems to be no major impact that heart disease has on a patients' age and their likelyhood of having a stroke.
##**Preprocessing and Model Tuning
When trying to best predict the likelyhood of patients having a stroke, various models can be used to choose which will be the most accurate and whether it will be useful for production for the business.
When models are chosen, the things to focus on should be the True Positves and True Negatives in a Confusion Matrix Display which is in the shape of a 2x2 grid. These numbers are displayed with True Postive in the Top Left and True Positive in the Bottom Right. The other numbers on display are False Postive (Top Right) where the model in this instance would predict a patient has had a stroke, when in actuality they have not. The inverse of this is called False Negative which is the number in the Bottom Left.
- Models are also chosen by something called a classification report where the model shows how it best fits the current problem it is presented with.
- Recall is especially import in this case, where it shows a percentage of the model predicting the correct number of true postive and true negative numbers out of everything that it is given. The higher the recall, the better the model is.
On first inspection, two models stood out because they had the lowest amount of False Postives and False Negatives and had the highest recall score
Logistic Regression
And ADABoost
- These models can later be adjusted or "tuned" to be even more accurate. Once this is done, a decision is made on the best model and whether it is viable for production.
##Best Tuned Model For Recommendation
Both the Logistic Regression and AdaBoost models were "tuned" with the goal to increase the accuracy or 'accuracy_score'.
The model that ultimately performed the best in this case was a Tuned ADABoost Model because it was the most accurate and lead to the lowest number of False Postives and False Negatives which would mean a misdiagnosis for the patient, and less chance of error. This would be the best model to recommend to a business such as a hospital to better guide patients on how to maintain a healthy lifestyle as they get older. Concerns that doctors could use to further accentuate the concern for patients and make them more aware are how their heart condition, BMI, and other factors contribute to the possibility of having a stroke.