Introduction

This is code for fitting a decision tree ML model from Sci-kit learn and predicting the classes of future text. inspired by article

The points of this notebook is to show what's going on under the hood of Sci-kit's model, especially how it weighs the inputs on the model. The Jupyter notebook will output the following image

Dataset

The dataset is a an email dataset with label 1 (spam) and 0 (not spam). Raw file is emails.csv Link to dataset The proccessed emails are stemmed, tokenized, have some junk text removed, and all numbers are replaced with the string "NUMBER". As well as some other small things.

Compatibility

Code updated to work with sklearn 0.22, but will break with 0.23.

Required packages are

pydotplus
GraphViz

AndreasLH/DecisionTreeVisualizationText

Introduction

Dataset

Compatibility