random-forest: A Jupyter Notebook repository from aashsach

1. Dataset Details

Name: Gisette Data Set

Download link: https://archive.ics.uci.edu/ml/machine-learning-databases/gisette/

2. Dataset Description 

GISETTE is a handwritten digit recognition problem. The problem is to
separate the highly confusible digits '4' and '9'. The digits are size
-normalized and centered in a fixed-size image of dimension 28x28. A number
of distractor features called 'probes' having no predictive power are also
present. The order of the features and patterns are randomized. Attribute
information is not provided to avoid biasing the feature selection process.

Data type: non-sparse

Number of features: 5000
		Real: 2500 
		Probes: 2500 
		Total: 5000 

Number of classes: 2

Number of examples:
     	Pos_ex	Neg_ex	Tot_ex
Train	 3000	 3000	 6000
Valid	  500	  500	 1000

3. Implementaion Details

Random Forest classification is used.
Total trees in forest is varied from 1 to 100. Each tree is a binary tree.
Features selected at random for each tree= sqrt(Number_of_features) = 70
Training examples are sampled with replacement from original training
dataset for each tree. Size of sampling = size of training set = 6000.
Ginni impurity was used to measure entropy while splitting node.

3.1. Algorithm

3.1.1. Training: Repaat following for size_of_random_forest_times
	1. Select random features, m without replacement and random data
	points of size n , with replacement.
	2. Make root with the data selected in step 1.
	3. Choose a threshold value of a feature for which entropy of
	splitting is minimum (Ginni).
	4. Divide data as per the threshold in left and right child of root.
	5. Repeat recursively with left and right child as root and data of
	each node as split in step 4.
	6. Base condition/stopiing condition:
		6.1. If all classes in node data are same, mark it as leaf
		with label as class label.
		6.2. If all data points in node ar same, mark it leaf with
		label as majority class.
		6.3. If entropy of a node is less than threshhold, mark it
		as leaf with label as majority class.
	
3.1.2. Classification: 
	1. For every test point, traverse every tree until leaf is reached.
	2. Classify it as label of leaf with respect to each tree.
	3. Label of test point is majority label predicted.

3.2. Data Structure

Node with following attributes is used to build tree:
	is_leaf: bool to ckeck whther leaf or not
        split_var: splitting feature if not leaf else null.
	split_val: threshold value of splitting feature if not leaf, else
		null.
        label: class label if leaf else null.
        children: list containing left and right nodes if not leaf.
        
4. Resuls/Accuracy

The plot of accuracy vs size_of_forest for training and validation data is
under folder named "results".
Broadly, accuracy of 100% is achieved in training data and 96.4% in
validation data.
aashsach/random-forest