/reforest

Random Forests in Apache Spark

Primary LanguageScalaApache License 2.0Apache-2.0

ReForeSt

Build Status Coverage Status license

ReForeSt is a distributed, scalable implementation of the RF learning algorithm which targets fast and memory efficient processing. ReForeSt main contributions are manifold: (i) it provides a novel approach for the RF implementation in a distributed environment targeting an in-memory efficient processing, (ii) it is faster and more memory efficient with respect to the de facto standard MLlib, (iii) the level of parallelism is self-configuring.

ReForeSt and its documentation have been designed for developers and data scientists which are familiar with the Spark Enviroment and the MLlib library. Consequently please refer first to those documentation before starting with ReForeSt

Additional details can be found at ReForeSt website: https://sites.google.com/view/reforest/home

Performance

Enreaching the comparison presented here we report a comparison between many Random Forest Libraries, in the same setting of the link reported above, and ReForeSt

Tool n Time (sec) RAM (GB) AUC
R 10K 50 10 68.2
. 100K 1200 35 71.2
. 1M crash
Python 10K 2 2 68.4
. 100K 50 5 71.4
. 1M 900 20 73.2
. 10M crash
H2O 10K 15 2 69.8
. 100K 150 4 72.5
. 1M 600 5 75.5
. 10M 4000 25 77.8
Spark 10K 50 10 69.1
. 100K 270 30 71.3
. 1M crash/2000 71.4
xgboost 10K 4 1 69.9
. 100K 20 1 73.2
. 1M 170 2 75.3
. 10M 3000 9 76.3
ReForeSt 10K 32 5 68.7
. 100K 123 40 72.1
. 1M 534 80 73.2
. 10M 2732 125 75.4

How to build

An already packaged ReForeSt in zip or tar.gz format can be found in the directory "resources/package". Otherwise it is possible to build ReForeSt using Maven:

mvn clean package

How to train a random forest with ReForeSt

import reforest.rf.{RFProperty, RFRunner}

// Create the ReForeSt configuration.
val property = RFParameterBuilder.apply
  .addParameter(RFParameterType.Dataset, "data/test10k-labels")
  .addParameter(RFParameterType.NumFeatures, 794)
  .addParameter(RFParameterType.NumClasses, 10)
  .addParameter(RFParameterType.NumTrees, 100)
  .addParameter(RFParameterType.Depth, 10)
  .addParameter(RFParameterType.BinNumber, 32)
  .addParameter(RFParameterType.SparkMaster, "local[4]")
  .addParameter(RFParameterType.SparkCoresMax, 4)
  .addParameter(RFParameterType.SparkPartition, 4 * 4)
  .addParameter(RFParameterType.SparkExecutorMemory, "4096m")
  .addParameter(RFParameterType.SparkExecutorInstances, 1)
  .build

val sc = CCUtil.getSparkContext(property)

// Create the Random Forest classifier.
val timeStart = System.currentTimeMillis()
val rfRunner = ReForeStTrainerBuilder.apply(property).build(sc)

// Train a Random Forest model.
val model = rfRunner.trainClassifier()
val timeEnd = System.currentTimeMillis()

// Evaluate model on test instances and compute test error
val labelAndPreds = rfRunner.getDataLoader.getTestingData.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
}

val testErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / rfRunner.getDataLoader.getTestingData.count()

println("Accuracy: "+(1 - testErr))
println("Time: " + (timeEnd - timeStart))
rfRunner.sparkStop()

Authors

ReForeSt has been developed at Smartlab - DIBRIS - University of Genoa.