1. Dataset Details Name: Gisette Data Set Download link: https://archive.ics.uci.edu/ml/machine-learning-databases/gisette/ 2. Dataset Description GISETTE is a handwritten digit recognition problem. The problem is to separate the highly confusible digits '4' and '9'. The digits are size -normalized and centered in a fixed-size image of dimension 28x28. A number of distractor features called 'probes' having no predictive power are also present. The order of the features and patterns are randomized. Attribute information is not provided to avoid biasing the feature selection process. Data type: non-sparse Number of features: 5000 Real: 2500 Probes: 2500 Total: 5000 Number of classes: 2 Number of examples: Pos_ex Neg_ex Tot_ex Train 3000 3000 6000 Valid 500 500 1000 3. Implementaion Details Random Forest classification is used. Total trees in forest is varied from 1 to 100. Each tree is a binary tree. Features selected at random for each tree= sqrt(Number_of_features) = 70 Training examples are sampled with replacement from original training dataset for each tree. Size of sampling = size of training set = 6000. Ginni impurity was used to measure entropy while splitting node. 3.1. Algorithm 3.1.1. Training: Repaat following for size_of_random_forest_times 1. Select random features, m without replacement and random data points of size n , with replacement. 2. Make root with the data selected in step 1. 3. Choose a threshold value of a feature for which entropy of splitting is minimum (Ginni). 4. Divide data as per the threshold in left and right child of root. 5. Repeat recursively with left and right child as root and data of each node as split in step 4. 6. Base condition/stopiing condition: 6.1. If all classes in node data are same, mark it as leaf with label as class label. 6.2. If all data points in node ar same, mark it leaf with label as majority class. 6.3. If entropy of a node is less than threshhold, mark it as leaf with label as majority class. 3.1.2. Classification: 1. For every test point, traverse every tree until leaf is reached. 2. Classify it as label of leaf with respect to each tree. 3. Label of test point is majority label predicted. 3.2. Data Structure Node with following attributes is used to build tree: is_leaf: bool to ckeck whther leaf or not split_var: splitting feature if not leaf else null. split_val: threshold value of splitting feature if not leaf, else null. label: class label if leaf else null. children: list containing left and right nodes if not leaf. 4. Resuls/Accuracy The plot of accuracy vs size_of_forest for training and validation data is under folder named "results". Broadly, accuracy of 100% is achieved in training data and 96.4% in validation data.