/RandomForest

Using Random Forest to predict the presence of heart disease

Primary LanguageJupyter NotebookMIT LicenseMIT

Random Forest

Using Random Forest to predict the presence of heart disease. See RandomForestClassifier documentation from scikit-learn.

About

With Random Forest, it is important to avoid overfitting and maximize out-of-sample accuracy by optimizing the model's hyperparameters. The following parameters are optimized in this example:

  • n_estimators
  • max_depth
  • max_features

The following optimization techniques are used:

  • OOB error reduction
  • grid search with 10-folds cross validation

The following feature importance techniques are used:

  • Gini impurity
  • Permutation feature importance

Installation

Data

The heart disease dataset from Kaggle is utilized

Results

Using the following parameters:

  • n_estimators = 90
  • max_depth = 2
  • max_featres = 'sqrt'

An in-sample F1 score of 88.81% and an out-of-sample F1 score of 88% are obtained. With any machine learning model, ideally you hope to obtain an in-sample accuracy a bit better than an out-of-sample accuracy. That way, you know your model is neither over nor underfit. Use these criteria, the modeling project was a success.