/NoiseFramework

Noise Filters for Big Data

Primary LanguageScalaApache License 2.0Apache-2.0

NoiseFramework

This framework implements two Big Data preprocessing approaches to remove noisy examples: an homogeneous ensemble (HME-BD) and an heterogeneous ensemble (HTE-BD) filter, with special emphasis in their scalability and performance traits. A simple filtering approach based on similarities between instances (ENN-BD) is also implemented.

This software has been proved with four large real-world datasets such as:

Brief benchmark results:

  • HME-BD has shown to be the best noise filter algorithm, achieving the best accuracy.
  • HME-BD is also the most efficient in terms of computing time.
  • HTE-BD can outperform HME-BD for some datasets under low levels of noise.

Example (HME-BD)

import org.apache.spark.mllib.feature._

val nTrees = 100
val maxDepth = 10
val partitions = 4

// Data must be cached in order to improve the performance

val hme_bd_model = new HME_BD(trainingData, // RDD[LabeledPoint]
                              nTrees, // size of the Random Forests
                              partitions, // number of partitions
                              maxDepth, // depth of the Random Forests
                              seed) // seed for the Random Forests

val hme_bd = hme_bd_model.runFilter()

Example (HTE-BD)

import org.apache.spark.mllib.feature._

val nTrees = 100
val maxDepth = 10
val partitions = 4
val voting = 0 // 0 = majority, 1 = consensus

// Data must be cached in order to improve the performance

val hte_bd_model = new HTE_BD(trainingData, // RDD[LabeledPoint]
                              nTrees, // size of the Random Forests
                              partitions, // number of partitions
                              vote, // voting strategy
                              k, // number of neighbors
                              maxDepth, // depth of the Random Forests
                              seed) // seed for the Random Forests

val hte_bd = hte_bd_model.runFilter()

Example (ENN-BD)

import org.apache.spark.mllib.feature._

// Data must be cached in order to improve the performance

val enn_bd_model = new ENN_BD(trainingData, // RDD[LabeledPoint]
                              k) // number of neighbors

val enn_bd = enn_bd_model.runFilter()