Decision Tree

%load_ext autoreload

%autoreload

import numpy as np 
import pandas as pd 
import decision_tree as dt

First Dataset

data_1 = pd.read_csv('data_1.csv')
data_1

	Outlook	Temperature	Humidity	Wind	Play Tennis
0	Sunny	Hot	High	Weak	No
1	Sunny	Hot	High	Strong	No
2	Overcast	Hot	High	Weak	Yes
3	Rain	Mild	High	Weak	Yes
4	Rain	Cool	Normal	Weak	Yes
5	Rain	Cool	Normal	Strong	No
6	Overcast	Cool	Normal	Strong	Yes
7	Sunny	Mild	High	Weak	No
8	Sunny	Cool	Normal	Weak	Yes
9	Rain	Mild	Normal	Weak	Yes
10	Sunny	Mild	Normal	Strong	Yes
11	Overcast	Mild	High	Strong	Yes
12	Overcast	Hot	Normal	Weak	Yes
13	Rain	Mild	High	Strong	No

Fit and Evaluate Model

# Separate independent (X) and dependent (y) variables
X = data_1.drop(columns=['Play Tennis'])
y = data_1['Play Tennis']

# Create and fit a Decrision Tree classifier
model_1 = dt.DecisionTree()
model_1.fit(X,y)

# Verify that it perfectly fits the training set
print(f'Accuracy: {dt.accuracy(y_true=y, y_pred=model_1.predict(X)) * 100 :.1f}%')

Accuracy: 100.0%

Inspect Classification Rules

A big advantage of Decision Trees is that they are relatively transparent learners. By this we mean that it is easy for an outside observer to analyse and understand how the model makes its decisions. The problem of being able to reason about how a machine learning model reasons is known as Explainable AI and is often a desirable property of machine learning systems.

model_1.print_rules("Yes")

❌ Outlook=Sunny ∩ Humidity=High => No
✅ Outlook=Sunny ∩ Humidity=Normal => Yes
✅ Outlook=Overcast => Yes
✅ Outlook=Rain ∩ Wind=Weak => Yes
❌ Outlook=Rain ∩ Wind=Strong => No

Second Dataset

data_2 = pd.read_csv('data_2.csv')
data_2 = data_2.drop(columns=['Founder Zodiac']) # Drops the column that creates noise in the learning
data_2

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Founder Experience	Second Opinion	Competitive Advantage	Lucurative Market	Outcome	Split
0	moderate	negative	yes	no	success	train
1	high	positive	yes	no	failure	train
2	low	negative	no	no	failure	train
3	low	negative	no	no	failure	train
4	low	positive	yes	yes	success	train
...	...	...	...	...	...	...
195	moderate	positive	no	yes	failure	test
196	low	negative	no	yes	failure	test
197	moderate	negative	no	yes	failure	test
198	moderate	negative	no	no	failure	test
199	moderate	negative	yes	no	success	test

200 rows × 6 columns

Split Data

The data is split into three sets:

train contains 50 samples that you should use to generate the tree
valid contains 50 samples that you can use to evaluate different preprocessing methods and variations to the tree-learning algorithm.
test contains 100 samples and should only be used to evaluate the final model once you're done experimenting.

data_2_train = data_2.query('Split == "train"')
data_2_valid = data_2.query('Split == "valid"')
data_2_test = data_2.query('Split == "test"')
X_train, y_train = data_2_train.drop(columns=['Outcome', 'Split']), data_2_train.Outcome
X_valid, y_valid = data_2_valid.drop(columns=['Outcome', 'Split']), data_2_valid.Outcome
X_test, y_test = data_2_test.drop(columns=['Outcome', 'Split']), data_2_test.Outcome
data_2.Split.value_counts()

test     100
train     50
valid     50
Name: Split, dtype: int64

Fit and Evaluate Model

# Fit model
model_2 = dt.DecisionTree()
model_2.fit(X_train, y_train)
print(f'Train: {dt.accuracy(y_train, model_2.predict(X_train)) * 100 :.1f}%')
print(f'Valid: {dt.accuracy(y_valid, model_2.predict(X_valid)) * 100 :.1f}%')

Train: 92.0%
Valid: 88.0%

Inspect Classification Rules

model_2.print_rules(outcome="success")

✅ Founder Experience=moderate ∩ Competitive Advantage=yes ∩ Lucurative Market=no ∩ Second Opinion=success => success
❌ Founder Experience=moderate ∩ Competitive Advantage=yes ∩ Lucurative Market=yes => failure
❌ Founder Experience=moderate ∩ Competitive Advantage=no ∩ Second Opinion=positive ∩ Lucurative Market=failure => failure
❌ Founder Experience=moderate ∩ Competitive Advantage=no ∩ Second Opinion=negative => failure
❌ Founder Experience=high ∩ Lucurative Market=no => failure
✅ Founder Experience=high ∩ Lucurative Market=yes ∩ Competitive Advantage=no ∩ Second Opinion=success => success
❌ Founder Experience=high ∩ Lucurative Market=yes ∩ Competitive Advantage=yes => failure
❌ Founder Experience=low ∩ Second Opinion=negative => failure
✅ Founder Experience=low ∩ Second Opinion=positive => success

rojahno/Decision-tree

Decision Tree

First Dataset

Fit and Evaluate Model

Inspect Classification Rules

Second Dataset

Split Data

Fit and Evaluate Model

Inspect Classification Rules