A Random Forest MapReduce implementation.
Used DecisionTree in another repository of mine.
[NOTE]: Random Forest wasn't supposed to fit into MapReduce framework and design logic, this is just a for-fun project and is not optimized. It works but the actually performance for this library might be really really bad.
[input training data folder] [output folder] [path to test data] [number of trees]
For example:
input output /path/to/test.csv 5
- Specifying type for each attributes is required.
- Specifying selected splitting attributes is required.
- After creating the instance of a
RFMapReduce
, callingsetTrainSubsetFraction()
is required, usually "0.67". - Call
RFDriver()
to execute. - (Optional) Calculate accuracy.
- Read train data from a CSV file.
- Build n InputSplits for n trees, n is a command line argument.
- Use customized
InputFormat.getSplits()
to create nInputSplit
s. So the framework would call n mappers. - Use customized
RecordReader.nextKeyValue()
to create 2/3 subset of the training data with replacement. - When
Mapper.run()
is callingnextKeyValue()
, this method directly return 2/3 of the data.
- Use customized
- Each
InputSplit
would assign to a mappper. - After receiving data, each mapper start to build tree and produce prediction for test dataset.
(Each mapper is only going to receive one key/value pair fromRecordReader
.) - Pass the test data and label as key and value to
Reducer
. Reducer
counts the majority label according to key.- Write results to output file.
- Use
process.py
to process thesmallerData.csv
file to get 80/20 train/test data(approximately label balanced). - Use all the jars in the
JARS
folder as this project's dependencies. (It's all hadoop 2.7.3 framework.)