/Decision-tree

Decision tree algorithm built from scratch. The algorithm is built with the same structural layout as Sklearns Decision tree.

Primary LanguageJupyter NotebookMIT LicenseMIT

Decision Tree

%load_ext autoreload
%autoreload

import numpy as np 
import pandas as pd 
import decision_tree as dt

First Dataset

data_1 = pd.read_csv('data_1.csv')
data_1
Outlook Temperature Humidity Wind Play Tennis
0 Sunny Hot High Weak No
1 Sunny Hot High Strong No
2 Overcast Hot High Weak Yes
3 Rain Mild High Weak Yes
4 Rain Cool Normal Weak Yes
5 Rain Cool Normal Strong No
6 Overcast Cool Normal Strong Yes
7 Sunny Mild High Weak No
8 Sunny Cool Normal Weak Yes
9 Rain Mild Normal Weak Yes
10 Sunny Mild Normal Strong Yes
11 Overcast Mild High Strong Yes
12 Overcast Hot Normal Weak Yes
13 Rain Mild High Strong No

Fit and Evaluate Model

# Separate independent (X) and dependent (y) variables
X = data_1.drop(columns=['Play Tennis'])
y = data_1['Play Tennis']

# Create and fit a Decrision Tree classifier
model_1 = dt.DecisionTree()
model_1.fit(X,y)

# Verify that it perfectly fits the training set
print(f'Accuracy: {dt.accuracy(y_true=y, y_pred=model_1.predict(X)) * 100 :.1f}%')
Accuracy: 100.0%

Inspect Classification Rules

A big advantage of Decision Trees is that they are relatively transparent learners. By this we mean that it is easy for an outside observer to analyse and understand how the model makes its decisions. The problem of being able to reason about how a machine learning model reasons is known as Explainable AI and is often a desirable property of machine learning systems.

model_1.print_rules("Yes")
❌ Outlook=Sunny ∩ Humidity=High => No
✅ Outlook=Sunny ∩ Humidity=Normal => Yes
✅ Outlook=Overcast => Yes
✅ Outlook=Rain ∩ Wind=Weak => Yes
❌ Outlook=Rain ∩ Wind=Strong => No

Second Dataset

data_2 = pd.read_csv('data_2.csv')
data_2 = data_2.drop(columns=['Founder Zodiac']) # Drops the column that creates noise in the learning
data_2
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Founder Experience Second Opinion Competitive Advantage Lucurative Market Outcome Split
0 moderate negative yes no success train
1 high positive yes no failure train
2 low negative no no failure train
3 low negative no no failure train
4 low positive yes yes success train
... ... ... ... ... ... ...
195 moderate positive no yes failure test
196 low negative no yes failure test
197 moderate negative no yes failure test
198 moderate negative no no failure test
199 moderate negative yes no success test

200 rows × 6 columns

Split Data

The data is split into three sets:

  • train contains 50 samples that you should use to generate the tree
  • valid contains 50 samples that you can use to evaluate different preprocessing methods and variations to the tree-learning algorithm.
  • test contains 100 samples and should only be used to evaluate the final model once you're done experimenting.
data_2_train = data_2.query('Split == "train"')
data_2_valid = data_2.query('Split == "valid"')
data_2_test = data_2.query('Split == "test"')
X_train, y_train = data_2_train.drop(columns=['Outcome', 'Split']), data_2_train.Outcome
X_valid, y_valid = data_2_valid.drop(columns=['Outcome', 'Split']), data_2_valid.Outcome
X_test, y_test = data_2_test.drop(columns=['Outcome', 'Split']), data_2_test.Outcome
data_2.Split.value_counts()
test     100
train     50
valid     50
Name: Split, dtype: int64

Fit and Evaluate Model

# Fit model
model_2 = dt.DecisionTree()
model_2.fit(X_train, y_train)
print(f'Train: {dt.accuracy(y_train, model_2.predict(X_train)) * 100 :.1f}%')
print(f'Valid: {dt.accuracy(y_valid, model_2.predict(X_valid)) * 100 :.1f}%')
Train: 92.0%
Valid: 88.0%

Inspect Classification Rules

model_2.print_rules(outcome="success")
✅ Founder Experience=moderate ∩ Competitive Advantage=yes ∩ Lucurative Market=no ∩ Second Opinion=success => success
❌ Founder Experience=moderate ∩ Competitive Advantage=yes ∩ Lucurative Market=yes => failure
❌ Founder Experience=moderate ∩ Competitive Advantage=no ∩ Second Opinion=positive ∩ Lucurative Market=failure => failure
❌ Founder Experience=moderate ∩ Competitive Advantage=no ∩ Second Opinion=negative => failure
❌ Founder Experience=high ∩ Lucurative Market=no => failure
✅ Founder Experience=high ∩ Lucurative Market=yes ∩ Competitive Advantage=no ∩ Second Opinion=success => success
❌ Founder Experience=high ∩ Lucurative Market=yes ∩ Competitive Advantage=yes => failure
❌ Founder Experience=low ∩ Second Opinion=negative => failure
✅ Founder Experience=low ∩ Second Opinion=positive => success