/C4.5-Algorithm

An implementation of C4.5 in python, including basic implementation of algorithm, pre-pruning, post-pruning and visualization.

Primary LanguageJupyter Notebook

C4.5 Algorithm

Introduction

C4.5 is an algorithm used to generate a decision tree developed by Ross Quinlan. C4.5 is an extension of Quinlan's earlier ID3 algorithm. The decision tree generated by C4.5 is used for classification tasks.

C4.5 builds decision trees from a set of training dataset in the same way as ID3, but using the concept of information entropy ratio.

Improvements from ID3 algorithm are as follows.
  • Handing both continuous and discrete attributes.
  • Handing training data with missing attribute values.
  • Pruning trees after creation.
Pseudocode

This algorithm has a few base cases:

  • All the samples in the list belong to the same class. When this happens, it simply creates a leaf node for the decision tree saying to choose that class.
  • None of the features provide any information gain. In this case, C4.5 creates a decision node higher up the tree using the expected value of the class.
  • Instance of previously unseen class encountered. Again, C4.5 creates a decision node higher up the tree using the expected value.

the general algorithm for building decision tree is:

1: Check for the above base cases.
2: for each attribute a
	find the information gain ratio from splitting on a.
3: Let a_best be the attribute with the highest information gain.
4: Create a decision node that splits on a_best.
5: Recurse on the sublists obtained by splitting on a_best, and add 	those nodes as children of node.

Requirements

  • Python 3
  • numpy
  • pandas
  • matplotlib

Configurations

  • IDE: Jupyter Notebook
  • OS: Windows 10 64bits

Install

Install using pip

pip install numpy
pip install pandas
pip install matplotlib