This research presents a data analytics approach and breakdown of the dataset containing heart disease patient symptoms. Our analysis explores heart disease among a population of males and females between 29 and 77 years of age using risk factors that determine its prevalence. We used popular python libraries to demonstrate our data interpretation and exploration. We were able to plot various forms of the dataset to show which symptoms were most important to the audience. We were able to leverage tensorflow and other python machine learning packages to overall precision of the dataset.
-
pandas
import pandas as pd
-
numpy
import numpy as np
-
matplotlib.pyplot
import matplotlib.pyplot as plt
-
tensorflow
import tensorflow as tf
-
seaborn
import seaborn as sns
-
sklearn:
- sklearn.metric:
- f1_score
- precision_score
- recall_score
- confusion_matrix
- sklearn.linear_model: LogisticRegression
- sklearn.model_selection: train_test_split
- sklearn.pipline: Pipline
- sklearn.ensemble: RandomForestClassifier
- sklearn.decimposition: PCA
- sklearn.preprocessing: StandardScaler
from sklearn.metrics import f1_score, precision_score, recall_score, confusion_matrix from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.pipeline import Pipeline from sklearn.ensemble import RandomForestClassifier from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA
- sklearn.metric:
-
outliers(data,column ,outliers)
removes outliers from a specific column in data. The outliers variable is a list for example [25,75] to remove any data outside this range. -
clean_data(data)
Remove NaNs in data and replace values 1,2,3,4 in the "target" column with 1 (heart disease). -
discretize(data, column, threshold)
discrectize a specific column in pandas dataframe (data) according to a threshold value (double). -
max_HR_percent(data, percent = 0.85)
Creates a new column in pandas dataframe (data). This new column contains 1 for patients that didn't reach at least 85% of their target heart rate and 0 otherwise. Target heart rate was calculated by the formula: Target Heart Rate = 220 - age -
heatmap_cor(dataset, plot_title, method = "spearman")
Plot correlation table as a heat map. -
run_random_forest(x_train,x_test,y_train,y_test, estimator = 10)
Fit random forest to training data and return this model -
plot_confusion_matrix(y_test,x_pred,plot_title)
Plot confusion matrix based on test data and predictions -
ml_train_test_split(x,y,size = 0.20,rs = 42)
Split data to training and test data. The size of test data is defined according to "size" and rs is random_state argument in train_test_split method of sklearn.
Clone the repo using the following command in terminal:
git clone https://github.com/avivfaraj/DSCI521-project.git
After cloning the repo, open hd_analysis.ipynb and run each cell one at a time in the order that they are presented. You can run the whole notebook in a single step by clicking on the menu Cell -> Run All.
The first two sections are packages and functions which are required for the code to run. Make sure to run those two sections before running the program.
Creators:
Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.
Donor: David W. Aha (aha '@' ics.uci.edu) (714) 856-8779