Brainstroke Prediction

Libraries used for dataset processing

Libraries used for graphical representation

Libraries used for Scaling and Oversampling

Removed the id column – decreasing the dimension – did not add to insights in the data analysis.

df = df.drop(['id'],axis=1)

print(df.isna().sum())

Only BMI-Attribute had NULL values
Plotted BMI's value distribution - looked skewed - therefore imputed the missing values using the median.
Didn’t eliminate the records due to dataset being highly skewed on the target attribute – stroke and a good portion of the missing BMI values had accounted for positive stroke

The dataset was skewed because there were only few records which had a positive value for stroke-target attribute
In the gender attribute, there were 3 types - Male, Female and Other. There was only 1 record of the type "other", Hence it was converted to the majority type – decrease the dimension
Most of the attributes in the dataset were binary values – converting the numeric bin values into string bin values for dummy encoding.
- Dummy encoding similar to one-hot encoding – Values in the binary ecoded columns are 1/0 – Additional attributes/columns created.
Random oversampling done on the dataset to balance the skew in the target attributes.
- Boosting the number of records in the minority class – records

Plotted plots of each attribute - Analyse trends if any – plots: pie, histogram.
Plotted relation of target attribute to other attributes to find any correlation.
Plotted the heatmap – correlation plot between the attributes.
- Heatmap showed very less correlation between the attribute values.

Applied various Machine learning models for predictive analysis

Analysed the results generated using confusion matrix - accuracy, precision, recall and plotting the ROC plot and generating the AUC scores.

Accuracies calculated:

Chosen model - RANDOM FOREST

Results were validated using the k fold (20 splits) validation for overfitting

danielchristopher513/Brain_Stroke_Prediction_Using_Machine_Learning