/PCARD

PCARD Ensemble classifier for Big Data

Primary LanguageScalaApache License 2.0Apache-2.0

PCARD Ensemble

This method implements the PCARD ensemble algorithm. PCARD ensemble method is a distributed upgrade of the method present in [1]. The algorithm performs Random Discretization and Principal Components Analysis to the input data, then joins the results and trains a decision tree on it.

This software has been proved with five large real-world datasets such as:

Brief benchmark results:

  • We outperform the original proposal and Random Forest implementation in MLlib for all datasets.
  • For epsilon dataset, we have outperformed the results of Random Forest by 5% less error with just 10 trees, compared to a Random Forest with up to 500 trees.

Example (ml)

import org.apache.spark.ml.classification._

val nTrees = 10
val nBins = 5

val labelIndexer = new StringIndexer()
      .setInputCol("label")
      .setOutputCol("indexedLabel")
      .fit(trainingData)

val pcard = new PCARDClassifier()
      .setLabelCol("indexedLabel")
      .setFeaturesCol("features")
      .setTrees(nTrees)
      .setCuts(nBins)

val labelConverter = new IndexToString()
      .setInputCol("prediction")
      .setOutputCol("predictedLabel")
      .setLabels(labelIndexer.labels)

val pipeline = new Pipeline()
      .setStages(Array(labelIndexer, pcard, labelConverter))

val model = pipeline.fit(trainingData)

val predictions = model.transform(testData)

Example (MLlib)

import org.apache.spark.mllib.tree._

val nTrees = 10
val nBins = 5

// Data must be cached in order to improve the performance

val pcardModel = PCARD.train(trainingData, // RDD[LabeledPoint]
                            nTrees, // size of the ensemble
                            nBins) // number of thresholds by feature

val predicted = pcardModel.predict(testData) // RDD[LabeledPoint]

References

[1] A. Ahmad and G. Brown, "Random Projection Random Discretization Ensembles - Ensembles of Linear Multivariate Decision Trees", Knowledge and Data Engineering, IEEE Transactions on, vol. 26, pp. 1225–1239, May 2014.