%load_ext autoreload
%autoreload
import numpy as np
import pandas as pd
import decision_tree as dt
data_1 = pd.read_csv('data_1.csv')
data_1
Outlook | Temperature | Humidity | Wind | Play Tennis | |
---|---|---|---|---|---|
0 | Sunny | Hot | High | Weak | No |
1 | Sunny | Hot | High | Strong | No |
2 | Overcast | Hot | High | Weak | Yes |
3 | Rain | Mild | High | Weak | Yes |
4 | Rain | Cool | Normal | Weak | Yes |
5 | Rain | Cool | Normal | Strong | No |
6 | Overcast | Cool | Normal | Strong | Yes |
7 | Sunny | Mild | High | Weak | No |
8 | Sunny | Cool | Normal | Weak | Yes |
9 | Rain | Mild | Normal | Weak | Yes |
10 | Sunny | Mild | Normal | Strong | Yes |
11 | Overcast | Mild | High | Strong | Yes |
12 | Overcast | Hot | Normal | Weak | Yes |
13 | Rain | Mild | High | Strong | No |
# Separate independent (X) and dependent (y) variables
X = data_1.drop(columns=['Play Tennis'])
y = data_1['Play Tennis']
# Create and fit a Decrision Tree classifier
model_1 = dt.DecisionTree()
model_1.fit(X,y)
# Verify that it perfectly fits the training set
print(f'Accuracy: {dt.accuracy(y_true=y, y_pred=model_1.predict(X)) * 100 :.1f}%')
Accuracy: 100.0%
A big advantage of Decision Trees is that they are relatively transparent learners. By this we mean that it is easy for an outside observer to analyse and understand how the model makes its decisions. The problem of being able to reason about how a machine learning model reasons is known as Explainable AI and is often a desirable property of machine learning systems.
model_1.print_rules("Yes")
❌ Outlook=Sunny ∩ Humidity=High => No
✅ Outlook=Sunny ∩ Humidity=Normal => Yes
✅ Outlook=Overcast => Yes
✅ Outlook=Rain ∩ Wind=Weak => Yes
❌ Outlook=Rain ∩ Wind=Strong => No
data_2 = pd.read_csv('data_2.csv')
data_2 = data_2.drop(columns=['Founder Zodiac']) # Drops the column that creates noise in the learning
data_2
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
</style>
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Founder Experience | Second Opinion | Competitive Advantage | Lucurative Market | Outcome | Split | |
---|---|---|---|---|---|---|
0 | moderate | negative | yes | no | success | train |
1 | high | positive | yes | no | failure | train |
2 | low | negative | no | no | failure | train |
3 | low | negative | no | no | failure | train |
4 | low | positive | yes | yes | success | train |
... | ... | ... | ... | ... | ... | ... |
195 | moderate | positive | no | yes | failure | test |
196 | low | negative | no | yes | failure | test |
197 | moderate | negative | no | yes | failure | test |
198 | moderate | negative | no | no | failure | test |
199 | moderate | negative | yes | no | success | test |
200 rows × 6 columns
The data is split into three sets:
train
contains 50 samples that you should use to generate the treevalid
contains 50 samples that you can use to evaluate different preprocessing methods and variations to the tree-learning algorithm.test
contains 100 samples and should only be used to evaluate the final model once you're done experimenting.
data_2_train = data_2.query('Split == "train"')
data_2_valid = data_2.query('Split == "valid"')
data_2_test = data_2.query('Split == "test"')
X_train, y_train = data_2_train.drop(columns=['Outcome', 'Split']), data_2_train.Outcome
X_valid, y_valid = data_2_valid.drop(columns=['Outcome', 'Split']), data_2_valid.Outcome
X_test, y_test = data_2_test.drop(columns=['Outcome', 'Split']), data_2_test.Outcome
data_2.Split.value_counts()
test 100
train 50
valid 50
Name: Split, dtype: int64
# Fit model
model_2 = dt.DecisionTree()
model_2.fit(X_train, y_train)
print(f'Train: {dt.accuracy(y_train, model_2.predict(X_train)) * 100 :.1f}%')
print(f'Valid: {dt.accuracy(y_valid, model_2.predict(X_valid)) * 100 :.1f}%')
Train: 92.0%
Valid: 88.0%
model_2.print_rules(outcome="success")
✅ Founder Experience=moderate ∩ Competitive Advantage=yes ∩ Lucurative Market=no ∩ Second Opinion=success => success
❌ Founder Experience=moderate ∩ Competitive Advantage=yes ∩ Lucurative Market=yes => failure
❌ Founder Experience=moderate ∩ Competitive Advantage=no ∩ Second Opinion=positive ∩ Lucurative Market=failure => failure
❌ Founder Experience=moderate ∩ Competitive Advantage=no ∩ Second Opinion=negative => failure
❌ Founder Experience=high ∩ Lucurative Market=no => failure
✅ Founder Experience=high ∩ Lucurative Market=yes ∩ Competitive Advantage=no ∩ Second Opinion=success => success
❌ Founder Experience=high ∩ Lucurative Market=yes ∩ Competitive Advantage=yes => failure
❌ Founder Experience=low ∩ Second Opinion=negative => failure
✅ Founder Experience=low ∩ Second Opinion=positive => success