/Heart-Failure-Prediction

Study of 12 clinical features por predicting death events

Primary LanguageJupyter Notebook

Heart Failure Prediction - Statistical Study

Author: Gabriel Espinola Lincoln Ferreira dos Santos

Heart failure (HF), also known as congestive heart failure (CHF), decompensatio cordis (DC), and congestive cardiac failure (CCF), is when the heart is unable to pump sufficiently to maintain blood flow to meet the body's needs. The study intend to do statistical analysis for heart failure in a dataset contains 12 features that can be used to predict mortality and correlating with articles.

Definition of done: Create a model for predicting mortality caused by Heart Failure.

Reference: Accessed by Capes periodicos

  1. MCMURRAY,John; PONIKOWSKI,Piotr. Heart Failure Not Enough Pump Iron? Glasgow, Scotland, United Kingdom and Wroclaw, Poland

  2. AM,Heart J Clinical predictors of heart failure in patients with first acute myocardial infarction

  3. ALI,Abbas S.;Clinical predictors of heart failure in patients with first acute myocardial infarction

  4. GOMES,Marilia B Impact of Diabetes on Cardiovascular Disease: An Update

  5. Creatine phosphokinase test: https://www.mountsinai.org/health-library/tests/creatine-phosphokinase-test

  6. Mohammed W. Akhter; Effect of Elevated Admission Serum Creatinine and Its Worsening on Outcome in Hospitalized Patients With Decompensated Heart Failure negrito

  7. Matheus, Alessandra; Impact of Diabetes on Cardiovascular Disease: An Update

  8. Abbas S. Ali; Clinical predictors of heart failure in patients with first acute myocardial infarction

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('Dataset.csv')
df.head(15)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
age anaemia creatinine_phosphokinase diabetes ejection_fraction high_blood_pressure platelets serum_creatinine serum_sodium sex smoking time DEATH_EVENT
0 75.0 0 582 0 20 1 265000.00 1.9 130 1 0 4 1
1 55.0 0 7861 0 38 0 263358.03 1.1 136 1 0 6 1
2 65.0 0 146 0 20 0 162000.00 1.3 129 1 1 7 1
3 50.0 1 111 0 20 0 210000.00 1.9 137 1 0 7 1
4 65.0 1 160 1 20 0 327000.00 2.7 116 0 0 8 1
5 90.0 1 47 0 40 1 204000.00 2.1 132 1 1 8 1
6 75.0 1 246 0 15 0 127000.00 1.2 137 1 0 10 1
7 60.0 1 315 1 60 0 454000.00 1.1 131 1 1 10 1
8 65.0 0 157 0 65 0 263358.03 1.5 138 0 0 10 1
9 80.0 1 123 0 35 1 388000.00 9.4 133 1 1 10 1
10 75.0 1 81 0 38 1 368000.00 4.0 131 1 1 10 1
11 62.0 0 231 0 25 1 253000.00 0.9 140 1 1 10 1
12 45.0 1 981 0 30 0 136000.00 1.1 137 1 0 11 1
13 50.0 1 168 0 38 1 276000.00 1.1 137 1 0 11 1
14 49.0 1 80 0 30 1 427000.00 1.0 138 0 0 12 0
df.isnull().values.any()
False
total = df.shape[0]
print("total of pacients %s"%(total))
total of pacients 299
total_death =df[df['DEATH_EVENT'] == 1 ].count()[0]

The dataset has 13 features from 299 pacients

df.describe()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
age anaemia creatinine_phosphokinase diabetes ejection_fraction high_blood_pressure platelets serum_creatinine serum_sodium sex smoking time DEATH_EVENT
count 299.000000 299.000000 299.000000 299.000000 299.000000 299.000000 299.000000 299.00000 299.000000 299.000000 299.00000 299.000000 299.00000
mean 60.833893 0.431438 581.839465 0.418060 38.083612 0.351171 263358.029264 1.39388 136.625418 0.648829 0.32107 130.260870 0.32107
std 11.894809 0.496107 970.287881 0.494067 11.834841 0.478136 97804.236869 1.03451 4.412477 0.478136 0.46767 77.614208 0.46767
min 40.000000 0.000000 23.000000 0.000000 14.000000 0.000000 25100.000000 0.50000 113.000000 0.000000 0.00000 4.000000 0.00000
25% 51.000000 0.000000 116.500000 0.000000 30.000000 0.000000 212500.000000 0.90000 134.000000 0.000000 0.00000 73.000000 0.00000
50% 60.000000 0.000000 250.000000 0.000000 38.000000 0.000000 262000.000000 1.10000 137.000000 1.000000 0.00000 115.000000 0.00000
75% 70.000000 1.000000 582.000000 1.000000 45.000000 1.000000 303500.000000 1.40000 140.000000 1.000000 1.00000 203.000000 1.00000
max 95.000000 1.000000 7861.000000 1.000000 80.000000 1.000000 850000.000000 9.40000 148.000000 1.000000 1.00000 285.000000 1.00000
df.mean()
age                             60.833893
anaemia                          0.431438
creatinine_phosphokinase       581.839465
diabetes                         0.418060
ejection_fraction               38.083612
high_blood_pressure              0.351171
platelets                   263358.029264
serum_creatinine                 1.393880
serum_sodium                   136.625418
sex                              0.648829
smoking                          0.321070
time                           130.260870
DEATH_EVENT                      0.321070
dtype: float64
df.median()
age                             60.0
anaemia                          0.0
creatinine_phosphokinase       250.0
diabetes                         0.0
ejection_fraction               38.0
high_blood_pressure              0.0
platelets                   262000.0
serum_creatinine                 1.1
serum_sodium                   137.0
sex                              1.0
smoking                          0.0
time                           115.0
DEATH_EVENT                      0.0
dtype: float64
df.max()
age                             95.0
anaemia                          1.0
creatinine_phosphokinase      7861.0
diabetes                         1.0
ejection_fraction               80.0
high_blood_pressure              1.0
platelets                   850000.0
serum_creatinine                 9.4
serum_sodium                   148.0
sex                              1.0
smoking                          1.0
time                           285.0
DEATH_EVENT                      1.0
dtype: float64

CORRELATION ANALYSIS TO UNDERSTAND THE INFLUENCE OF EACH FEATURE

f,ax = plt.subplots(figsize=(15, 15))
sns.heatmap(df.corr(), annot=True, linewidths=.5, fmt= '.1f',ax=ax)
plt.show()

png

Hypotesis and Application from articles

Clinical predictors of heart failure in patients with first acute myocardial infarction - "Predictors of early heart failure include previous medical conditions and age. The second peak occurrence can be predicted by similar characteristics in addition to increased peak creatine phosphokinase level, decreased left ventricular ejection fraction, and increased heart rate" (Am Heart J 1999;138:1133-9.)

Consideration :

Normal Values:

  • Creatine phosphokinase: 2 - 210 mcg/L

  • Ejection fraction : 50 %

  • Medical Follow-up : >=60 days

For this dataset, does creatine phosphokinase (increase) and ejection fraction (decrease) are behaving regarding a health valeu as the article metioned?

import seaborn as sns; 
g1 = sns.pairplot(df,vars= ['creatinine_phosphokinase','ejection_fraction','time'], hue= 'DEATH_EVENT',markers=["o", "X" ],palette='dark')

png

% Death event of people who creatinine phosphokinase increased over normal, ejection fraction under normal and medical follow up under the stander 60 days :

H1 = df[['creatinine_phosphokinase', 'ejection_fraction','DEATH_EVENT','time']][(df['creatinine_phosphokinase'] > 210) & (df['ejection_fraction'] < 50) & (df['time'] < 60)]
H1['DEATH_EVENT'].mean()
0.92

% Death event of people who creatinine phosphokinase increased over normal, ejection fraction under normal and medical follow up under the stander 60 days :

H1 = df[['creatinine_phosphokinase', 'ejection_fraction','DEATH_EVENT','time']][(df['creatinine_phosphokinase'] > 210) & (df['ejection_fraction'] < 50) & (df['time'] >= 60)]
H1['DEATH_EVENT'].mean()
0.21367521367521367

Going a little deeper...

H1 = df[['creatinine_phosphokinase', 'ejection_fraction','DEATH_EVENT']][(df['creatinine_phosphokinase'] > 210) & (df['ejection_fraction'] < 50) & (df['DEATH_EVENT'] == 1)]
deathCr = H1['DEATH_EVENT'].count()
H1 = df[['creatinine_phosphokinase', 'ejection_fraction','DEATH_EVENT']][(df['creatinine_phosphokinase'] > 210) & (df['ejection_fraction'] < 50)]

print(r'%s pacients of 299 had values had not normal value for each feature. Representing %s percent of pacients and %s percent of total death.  '%(H1.shape[0],round(((H1.shape[0]*100)/total),2), (deathCr*100)/total_death ))
142 pacients of 299 had values had not normal value for each feature. Representing 47.49 percent of pacients and 50.0 percent of total death.  

CONCLUSION

The value for creatinine phospkinase and ejection fraction are definitely significant for prediction of a heart failure. But a long-term medical follow-up must reduce drastically the chance of death. Which 92 % of the pacients passed away due heart failure who had creatinine phosphokinase's level above normal, ejection fraction under normal and medical follow-up less than 60 days. But, on another hand, only 21% with same creatinine and ejection fraction behavior and medical follow-up equal, or more, than 60 days passed away due a heart failure pacients

These unbalaced values are 47.5% of pacients and 50% of total of death in hole dataset.

Role of Diabetes in Congestive Heart Failure - "Men aged 45 to 74 years had more than twice the frequency of congestive heart failure as their nondiabetic cohorts, and diabetic women had a fivefold increased risk."

For this dataset, What the frequency of diabetic-men aged 45 to 74 years to had heart failure more than non-diabetic? What the same analyze for women 30 to 62 years ?

man   =  df['sex'][(df['sex'] == 1)]
woman =  df['sex'][(df['sex'] == 0)]

m = (man.count()*100)/total
w = (woman.count()*100)/total


import matplotlib.pyplot as plt

labels = 'Men','Women' 
sizes  = [m,w]
explode = (0, 0)  # only "explode" the 2nd slice (i.e. 'Hogs')


fig1, ax1 = plt.subplots()
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.title("% Men & Women", bbox={'facecolor':'0.8', 'pad':5})
plt.show()

AM = df[['age','sex']][(df['sex'] == 1) & (df['age'] >= 45) & (df['age'] <= 74)]

labels = 'Hypotese age (45-74 years)','Out Hypotese age' 
sizes  = [(AM.count()[0]*100)/man.count(),((man.count()-AM.count()[0])*100)/man.count()]
explode = (0.1, 0)  # only "explode" the 2nd slice (i.e. 'Hogs')


fig1, ax1 = plt.subplots()
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.title("%Men in hypotese age ", bbox={'facecolor':'0.8', 'pad':5})
plt.show()

png

png

g2 = sns.pairplot(df,vars= ['diabetes','age','sex','time'], hue= 'DEATH_EVENT',markers=["o", "X" ],palette='dark')

png

N Death events to diabetic-men aged 45 to 74 years

H2 = df[['diabetes','age','sex','time','DEATH_EVENT']][(df['diabetes'] == 1) & (df['sex'] == 1) & (df['age'] >=45) & (df['age'] <= 74) & (df['DEATH_EVENT'] == 1)]
diab = H2['DEATH_EVENT'].count()
diab
16

N Death events to non-diabetic-men aged 45 to 74 years

H2 = df[['diabetes','age','sex','time','DEATH_EVENT']][(df['diabetes'] == 0) & (df['sex'] == 1) & (df['age'] >=45) & (df['age'] <= 74) & (df['DEATH_EVENT'] == 1)]
nodiab = H2['DEATH_EVENT'].count()
nodiab
26

Frequency (% Death events to diabetic-men) / (% Death events to non-diabetic-men)

diab/nodiab
0.6153846153846154
AW = df[['age','sex']][(df['sex'] == 0) & (df['age'] >= 30) & (df['age'] <= 62)]

labels = 'Hypotese age (30-62 years)','Out Hypotese age' 
sizes  = [(AW.count()[0]*100)/woman.count(),((woman.count()-AW.count()[0])*100)/woman.count()]
explode = (0, 0.1)  # only "explode" the 2nd slice (i.e. 'Hogs')

fig1, ax1 = plt.subplots()
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.title("%woman in age hypotese", bbox={'facecolor':'0.8', 'pad':5})
plt.show()

png

N Death events to diabetic-women aged 30 to 62 years

H2 = df[['diabetes','age','sex','time','DEATH_EVENT']][(df['diabetes'] == 1) & (df['sex'] == 0) & (df['age'] >=30) & (df['age'] <= 62) & (df['DEATH_EVENT'] == 1)]
diab = H2['DEATH_EVENT'].count()
diab
14

N Death events to non-diabetic-women aged 30 to 62 years

H2 = df[['diabetes','age','sex','time','DEATH_EVENT']][(df['diabetes'] == 0) & (df['sex'] == 0) & (df['age'] >=30) & (df['age'] <= 62) & (df['DEATH_EVENT'] == 1)]
nodiab = H2['DEATH_EVENT'].count()
nodiab
6

Frequency (% Death events to diabetic-women) / (% Death events to non-diabetic-women)

diab/nodiab
2.3333333333333335

CONCLUSION

For this dataset, diabetic-men aged 45 to 74 years were less frequency (0.9 times) of death than non-diabetics Diabetic-women aged 30 to 62 year had more frequency (1.5 times) of death than non-diabetics.

The hypotese doens't correspond to the value expected although just Diabetic-women had a incresing on Death events but under de 5 times frequency.

Effect of Elevated Admission Serum Creatinine and Its Worsening on Outcome in Hospitalized Patients With Decompensated Heart Failure - "Renal insufficiency (RI), as represented by elevated serum creatinine (>1.5 mg/dl) on admission, is common and found in almost half of patients hospitalized with decompensated heart failure"

Consideration :

Normal Values

serum creatinine: < 1.5 mg/dL

Medical Follow-up : <=60 days

For this dataset, what the correlation of serum creatinine and death events? Is the follow-up time is relevant?

g3 = sns.pairplot(df,vars= ['serum_creatinine','time'], hue= 'DEATH_EVENT',markers=["o", "X" ],palette='dark')

png

H3 = df[['serum_creatinine','time','DEATH_EVENT']][(df['serum_creatinine'] > 1.5) & (df['time'] < 60) ]
deathCre = H3[(H3['DEATH_EVENT'] == 1)].count()
noDeathCre = H3[(H3['DEATH_EVENT'] == 0)].count()

total3 = H3.count()
#deathCre[0], noDeathCre[0],total3[0]

Total number of patients with serum creatinine over normal value:

total3[0]
23

Number of death event for total of patients with serum creatinine over normal value:

deathCre[0] 
21

% Death Event in total

deathCre[0]*100/total3[0]
91.30434782608695

%death events to follow-up over the 60 days:

H3 = df[['serum_creatinine','time','DEATH_EVENT']][(df['serum_creatinine'] > 1.5) & (df['time'] >= 60) ]
deathCre = H3[(H3['DEATH_EVENT'] == 1)].count()
noDeathCre = H3[(H3['DEATH_EVENT'] == 0)].count()

total3 = H3.count()
#deathCre[0], noDeathCre[0],total3[0]
deathCre[0]*100/total3[0]
50.0

Disregarding time, %Death events and total of pacients over >1.5 mg/dL

H3 = df[['serum_creatinine','DEATH_EVENT']][(df['serum_creatinine'] > 1.5)]
deathCre = H3[(H3['DEATH_EVENT'] == 1)].count()
noDeathCre = H3[(H3['DEATH_EVENT'] == 0)].count()

total3 = H3.count()
#deathCre[0], noDeathCre[0],total3[0]
deathCre[0]*100/total3[0], total3[0]
(64.17910447761194, 67)

CONCLUSION

The value for serum creatinine (SC) are definitely significant for prediction of a heart failure. Considering the follow-up time under of 60 days, the death events for pacients with SC over normal value is equal to 91.3 % (21 pacients) in the total of 23 pacients. For pacients who follow-up were over 60 days, the death event goes to 50% (22 pacients) of total of 44 pacients. From total of 67 pacients, 64% passed away.

Anaemia is an independent predictor of poor outcome in patients with chronic heart failure - "Mild anaemia is a significant and independent predictor of poor outcome in unselected patients with CHF."

What the relevance of anaemia to heart failure ?

H4 = df[['anaemia','DEATH_EVENT','time']]

Number of pacients with anemia x without anemia

H4[(H4['anaemia']==1)].count()[0], H4[(H4['anaemia']==0)].count()[0]
(129, 170)

Number of pacients with anemia and passed away

H4[(H4['anaemia']==1) & (H4['DEATH_EVENT']==1)].count()[0]
46

%Death anaemie pacient / total anaemie pacient

H4[(H4['anaemia']==1) & (H4['DEATH_EVENT']==1)].count()[0]*100/H4[(H4['anaemia']==1)].count()[0]
35.65891472868217
H4[(H4['anaemia']==1) & (H4['time']<= 60)].count()[0], H4[(H4['anaemia']==0) & (H4['time']<= 60)].count()[0]
(33, 30)
H4[(H4['anaemia']==1) & (H4['DEATH_EVENT']==1) & (H4['time']<= 60)].count()[0]
28

GENERAL OVERVIEW

Death by gender

G1 = df[['sex','DEATH_EVENT']]
m   = G1[(G1['sex']==1) & (G1['DEATH_EVENT']==1)]
mdp = m['sex'].count()*100/G1[(G1['sex']==1)].count()[0]

w   = G1[(G1['sex']==0) & (G1['DEATH_EVENT']==1)]
wdp = w['sex'].count()*100/G1[(G1['sex']==0)].count()[0]
labels = 'Death','No-Death' 
sizes  = [mdp,100-mdp]
explode = (0.1, 0)  # only "explode" the 2nd slice (i.e. 'Hogs')


fig1, ax1 = plt.subplots()
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.title("% Death in Men Sample", bbox={'facecolor':'0.8', 'pad':5})
plt.show()


labels = 'Death','No-Death' 
sizes  = [wdp,100-wdp]
explode = (0.1, 0)  # only "explode" the 2nd slice (i.e. 'Hogs')


fig1, ax1 = plt.subplots()
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.title("%Death in Women Sample ", bbox={'facecolor':'0.8', 'pad':5})
plt.show()

png

png

Death by Age

#### 

Death patient smokes x not

G3 = df[['smoking','DEATH_EVENT']]
s   = G3[(G3['smoking']==1) & (G3['DEATH_EVENT']==1)]
ns  = G3[(G3['smoking']==0) & (G3['DEATH_EVENT']==1)]                              
      
Ts   = G3[(G3['smoking']==1) & (G3['DEATH_EVENT']==0)]
Tns  = G3[(G3['smoking']==0) & (G3['DEATH_EVENT']==0)] 

#s.count()[0], Ts.count()[0], ns.count()[0], Tns.count()[0]
labels = 'Death','No-Death' 
sizes  = [ns.count()[0]*100/(ns.count()[0]+Tns.count()[0]), Tns.count()[0]*100/(ns.count()[0]+Tns.count()[0])]
explode = (0.1, 0)  # only "explode" the 2nd slice (i.e. 'Hogs')


fig1, ax1 = plt.subplots()
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.title("% Death in Smokers Sample", bbox={'facecolor':'0.8', 'pad':5})
plt.show()


labels = 'Death','No-Death' 
sizes  = [wdp,100-wdp]
explode = (0.1, 0)  # only "explode" the 2nd slice (i.e. 'Hogs')


fig1, ax1 = plt.subplots()
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.title("%Death in No-Smokers Sample ", bbox={'facecolor':'0.8', 'pad':5})
plt.show()

png

png

Death patient with high blood pressure x not

Applying Machine Learning - Considering Time

import numpy as np
%matplotlib inline
import seaborn as sns
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV
from sklearn.svm import LinearSVR , SVR
from sklearn.tree import  DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor 
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split , KFold , cross_val_score,StratifiedKFold
from sklearn.metrics import mean_absolute_error , mean_squared_error
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import ElasticNet, Lasso, Ridge
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
from xgboost import XGBClassifier
from sklearn.datasets import load_iris
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification


import functools

import sys
import warnings

if not sys.warnoptions:
    warnings.simplefilter("ignore")
X = df.drop('DEATH_EVENT',axis=1)
y = df['DEATH_EVENT']

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.3,random_state = 42)
def cross_valid (model,name,X = X_train , y= y_train):
  
  modelo = model()
  cv = 5  #quantidade de vezes q vai rodar 
  scoring = 'neg_mean_squared_error' # funcão mse dentro do cross_validation_score
  n_jobs = -1
  
  
  score = cross_val_score(modelo,X ,y , cv= cv , scoring= scoring, n_jobs= n_jobs)

  RMSE = np.sqrt(- score.mean())


 ##menor melhor
  print('RMSE',name,RMSE)
cross_valid(RandomForestRegressor,'random_forest')
RMSE random_forest 0.3310877086934085
cross_valid(XGBRegressor,'xg_boost')
RMSE xg_boost 0.3555536909123532
cross_valid(Lasso,'lasso')
RMSE lasso 0.3838246741795928
cross_valid(Ridge,'ridge')
RMSE ridge 0.3536173277735311
cross_valid(SVR,'SVR')
RMSE SVR 0.48558580486692
cross_valid(LinearSVR,'Linear_SVR')
RMSE Linear_SVR 0.6336108098091258
cross_valid(XGBClassifier,'XGBClassifier')
RMSE XGBClassifier 0.3849672294000657
cross_valid(DecisionTreeClassifier,'Decision_tree')
RMSE Decision_tree 0.42044166811032346
cross_valid(RandomForestClassifier,'RandomForestClassifier')
RMSE RandomForestClassifier 0.3656251706175088
X = df.drop('DEATH_EVENT',axis=1)
y = df['DEATH_EVENT']
plt.figure(figsize=(13,6))
dt=RandomForestClassifier()
dt.fit(X,y)
feat_importances1 = pd.Series(dt.feature_importances_, index=X.columns)
feat_importances1.sort_values(ascending=True).plot(kind='barh')
plt.show()

png

feat_importances1.head(14)
age                         0.083395
anaemia                     0.013054
creatinine_phosphokinase    0.078637
diabetes                    0.015090
ejection_fraction           0.120798
high_blood_pressure         0.010296
platelets                   0.071771
serum_creatinine            0.130493
serum_sodium                0.069820
sex                         0.012747
smoking                     0.013187
time                        0.380713
dtype: float64
param_grid = { 
            "n_estimators"      : [10,50,100,200,700],
            "max_features"      : ["auto", "sqrt", "log2"],
            "min_samples_split" : [2,4,8],
            "bootstrap": [True, False],
            }
grid =GridSearchCV(RandomForestClassifier(), param_grid, cv=5, 
                   n_jobs=-1, verbose=1)
grid.fit(X_train, y_train)
Fitting 5 folds for each of 90 candidates, totalling 450 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  60 tasks      | elapsed:   13.1s
[Parallel(n_jobs=-1)]: Done 210 tasks      | elapsed:   51.3s
[Parallel(n_jobs=-1)]: Done 450 out of 450 | elapsed:  1.7min finished





GridSearchCV(cv=5, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=100, n_jobs=None,
                                              oob_score=False,
                                              random_state=None, verbose=0,
                                              warm_start=False),
             iid='deprecated', n_jobs=-1,
             param_grid={'bootstrap': [True, False],
                         'max_features': ['auto', 'sqrt', 'log2'],
                         'min_samples_split': [2, 4, 8],
                         'n_estimators': [10, 50, 100, 200, 700]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=1)
RMSE_grid_param = np.sqrt(grid.best_score_)

RMSE_grid_param
0.9384050025956979
grid.best_params_
{'bootstrap': True,
 'max_features': 'auto',
 'min_samples_split': 4,
 'n_estimators': 100}
pred = grid.predict(X_test)
pred
array([0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
       1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       1, 0])
score = r2_score(y_test,pred)
score
0.08210096889342178
MSE = mean_squared_error(y_test,pred)
MSE
0.2222222222222222
RMSE_pred = np.sqrt(MSE)
RMSE_pred
0.4714045207910317
pred2 = grid.predict(X_train)
MSE2 = mean_squared_error(y_train,pred2)
RMSE2_pred = np.sqrt(MSE2)
RMSE2_pred
0.0
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt

data = {'y_Actual':    y_test,
        'y_Predicted': pred 
        }

dft = pd.DataFrame(data, columns=['y_Actual','y_Predicted'])
confusion_matrix = pd.crosstab(dft['y_Actual'], dft['y_Predicted'], rownames=['Actual'], colnames=['Predicted'])

sn.heatmap(confusion_matrix, annot=True)
plt.show()

png


Applying Machine Learning - Without Time

X = df.drop(['DEATH_EVENT', 'time'],axis=1)
y = df['DEATH_EVENT']
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.3,random_state = 42)
def cross_valid (model,name,X = X_train , y= y_train):
  
  modelo = model()
  cv = 5  #quantidade de vezes q vai rodar 
  scoring = 'neg_mean_squared_error' # funcão mse dentro do cross_validation_score
  n_jobs = -1
  
  
  score = cross_val_score(modelo,X ,y , cv= cv , scoring= scoring, n_jobs= n_jobs)

  RMSE = np.sqrt(- score.mean())


 ##menor melhor
  print('RMSE',name,RMSE)
cross_valid(RandomForestClassifier,'RandomForestClassifier')
RMSE RandomForestClassifier 0.4637873676759594
cross_valid(RandomForestRegressor,'random_forest')
RMSE random_forest 0.3977204207745
cross_valid(XGBRegressor,'xg_boost')
RMSE xg_boost 0.40968810731337457
cross_valid(Lasso,'lasso')
RMSE lasso 0.4512461871781548
cross_valid(Ridge,'ridge')
RMSE ridge 0.41835811411245727
cross_valid(SVR,'SVR')
RMSE SVR 0.4855765692487356
cross_valid(LinearSVR,'Linear_SVR')
RMSE Linear_SVR 0.9000051885995444
cross_valid(XGBClassifier,'XGBClassifier')
RMSE XGBClassifier 0.4985460859015062
cross_valid(DecisionTreeClassifier,'Decision_tree')
RMSE Decision_tree 0.5312532024908597
cross_valid(RandomForestClassifier,'RandomForestClassifier')
RMSE RandomForestClassifier 0.45887809458619494
plt.figure(figsize=(13,6))
dt=RandomForestRegressor()
dt.fit(X,y)
feat_importances1 = pd.Series(dt.feature_importances_, index=X.columns)
feat_importances1.sort_values(ascending=True).plot(kind='barh')
plt.show()

png

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.3,random_state = 42)
param_grid = { 
            "n_estimators"      : [10,50,100,200],
            "max_features"      : ["auto", "sqrt", "log2"],
            "min_samples_split" : [2,4,8],
            "bootstrap": [True, False],
            }
grid =GridSearchCV(RandomForestClassifier(), param_grid, cv=5, 
                   n_jobs=-1, verbose=1)
grid.fit(X_train, y_train)
Fitting 5 folds for each of 72 candidates, totalling 360 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done 164 tasks      | elapsed:   17.9s
[Parallel(n_jobs=-1)]: Done 360 out of 360 | elapsed:   36.0s finished





GridSearchCV(cv=5, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=100, n_jobs=None,
                                              oob_score=False,
                                              random_state=None, verbose=0,
                                              warm_start=False),
             iid='deprecated', n_jobs=-1,
             param_grid={'bootstrap': [True, False],
                         'max_features': ['auto', 'sqrt', 'log2'],
                         'min_samples_split': [2, 4, 8],
                         'n_estimators': [10, 50, 100, 200]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=1)
RMSE_grid_param = np.sqrt(grid.best_score_)
RMSE_grid_param
0.8939725901110832
grid.best_params_
{'bootstrap': True,
 'max_features': 'log2',
 'min_samples_split': 4,
 'n_estimators': 200}
pred = grid.predict(X_test)
pred
array([1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       1, 0])
score = r2_score(y_test,pred)
score
-0.4227434982151961
MSE = mean_squared_error(y_test,pred)
MSE
0.34444444444444444
RMSE_pred = np.sqrt(MSE)
RMSE_pred
0.5868938953886337
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt

data = {'y_Actual':    y_test,
        'y_Predicted': pred 
        }

dft = pd.DataFrame(data, columns=['y_Actual','y_Predicted'])
confusion_matrix = pd.crosstab(dft['y_Actual'], dft['y_Predicted'], rownames=['Actual'], colnames=['Predicted'])

sn.heatmap(confusion_matrix, annot=True)
plt.show()

png