Random Forest on MapReduce

A Random Forest MapReduce implementation.

Used DecisionTree in another repository of mine.

[NOTE]: Random Forest wasn't supposed to fit into MapReduce framework and design logic, this is just a for-fun project and is not optimized. It works but the actually performance for this library might be really really bad.

Instructions

command line parameters

[input training data folder] [output folder] [path to test data] [number of trees]

For example: input output /path/to/test.csv 5

Steps:

Specifying type for each attributes is required.
Specifying selected splitting attributes is required.
After creating the instance of a RFMapReduce, calling setTrainSubsetFraction() is required, usually "0.67".
Call RFDriver() to execute.
(Optional) Calculate accuracy.

Structures

Read train data from a CSV file.
Build n InputSplits for n trees, n is a command line argument.
1. Use customized InputFormat.getSplits() to create n InputSplits. So the framework would call n mappers.
2. Use customized RecordReader.nextKeyValue() to create 2/3 subset of the training data with replacement.
3. When Mapper.run() is calling nextKeyValue(), this method directly return 2/3 of the data.
Each InputSplit would assign to a mappper.
After receiving data, each mapper start to build tree and produce prediction for test dataset.
(Each mapper is only going to receive one key/value pair from RecordReader.)
Pass the test data and label as key and value to Reducer.
Reducer counts the majority label according to key.
Write results to output file.

Notes