Brushfire
Brushfire is a framework for distributed supervised learning of decision tree ensemble models in Scala.
The basic approach to distributed tree learning is inspired by Google's PLANET, but considerably generalized thanks to Scala's type parameterization and Algebird's aggregation abstractions.
Brushfire currently supports:
- binary and multi-class classifiers
- numeric features (discrete and continuous)
- categorical features (including those with very high cardinality)
- k-fold cross validation and random forests
- chi-squared test as a measure of split quality
- feature importance and brier scores
- Scalding/Hadoop as a distributed computing platform
In the future we plan to add support for:
- regression trees
- CHAID-like multi-way splits
- error-based pruning
- many more ways to evaluate splits and trees
- Spark and single-node in-memory platforms
Authors
- Avi Bryant http://twitter.com/avibryant
Thanks for assistance and contributions:
- Steven Noble http://twitter.com/snoble
- Colin Marc http://twitter.com/colinmarc
- Dan Frank http://twitter.com/danielhfrank
Quick start
mvn package
cd example
./iris
cat iris.output/step_03
If it worked, you should see a JSON representation of 4 versions of a decision tree for classifying irises.
To use brushfire as a jar in your own project, add the following to your POM file:
<dependency>
<groupId>com.stripe</groupId>
<artifactId>brushfire</artifactId>
<version>0.4.0</version>
</dependency>
Using Brushfire with Scalding
The only distributed computing platform that Brushfire currently supports is Scalding, version 0.12 or later.
The simplest way to use Brushfire with Scalding is by subclassing TrainerJob and overriding trainer
to return an instance of Trainer. Example:
import com.stripe.brushfire._
import com.stripe.brushfire.scalding._
import com.twitter.scalding._
class MyJob(args: Args) extends TrainerJob(args) {
def trainer = ???
}
```
To construct a `Trainer`, you need to pass it training data as a Scalding `TypedPipe` of Brushfire [Instance[K, V,T]](http://stripe.github.io/brushfire/#com.stripe.brushfire.Instance) objects. `Instance` looks like this:
````scala
case class Instance[K, V, T](id: String, timestamp: Long, features: Map[K, V], target: T)
- The
id
should be unique for each instance. - If there's an associated observation time, it should be the
timestamp
. (Otherwise0L
is fine) features
is aMap
from feature name (type K, usually String) to some value of type V. There's built-in implicit support forInt
,Double
,Boolean
, andString
types (with the assumption forInt
andString
that there is a small, finite number of possible values). If, as is common, you need to mix different feature types, see the section onDispatched
below.- the only built-in support for
target
currently is forMap[L,Long]
, whereL
represents some label type (for exampleBoolean
for a binary classifier orString
for multi-class). TheLong
values represent the weight for the instance, which is usually 1.
Example:
Instance("AS-2014", 1416168857L, Map("lat" -> 49.2, "long" -> 37.1, "altitude" -> 35000.0), Map(true -> 1L))
You also need to pass it a Sampler. Here are some samplers you might use:
- SingleTreeSampler will use the entirety of the training data to construct a single tree.
- KFoldSampler(numTrees: Int) will construct k different trees, each excluding a random 1/k of the data, for use in cross-validation.
- RFSampler(numTrees: Int, featureRate: Double, samplingRate: Double) will construct multiple trees, each using a separate bootstrap sample (using
samplingRate
, which defaults to1.0
). Each node in the tree will also only consider a randomfeatureRate
sample of the features available. (This is the approach used for random forests).
One you have constructed a Trainer
, you most likely want to call expandTimes(base: String, times: Int)
. This will build a new ensemble of trees from the training data and expand them times
times, to depth times + 1
. At each step, the trees will be serialized to a directory (on HDFS, unless you're running in local mode) under base
.
Fuller example:
import com.stripe.brushfire._
import com.stripe.brushfire.scalding._
import com.twitter.scalding._
class MyJob(args: Args) extends TrainerJob(args) {
def trainingData: TypedPipe[Instance[K, V,T]] = ???
def trainer = Trainer(trainingData, KFoldSampler(4)).expandTimes(args("output"), 5)
}
Dispatched
If you have mixed features, the recommended value type is Dispatched[Int,String,Double,String]
, which requires your feature values to match any one of these four cases:
Ordinal(v: Int)
for numeric features with a reasonably small number of possible valuesNominal(v: String)
for categorical features with a reasonably small number of possible valuesContinuous(v: Double)
for numeric features with a large or infinite number of possible valuesSparse(v: String)
for categorical features with a large of infinite number of possible values
Note that using Sparse
and especially Continuous
features will currently slow learning down considerably. (But on the other hand, if you try to use Ordinal
or Nominal
with a feature that has hundreds of thousands of unique values, it will be even slower, and then fail).
Example of a features map:
Map("age" -> Ordinal(35), "gender" -> Nominal("male"), "weight" -> Continuous(130.23), "name" -> Sparse("John"))
Extending Brushfire
Brushfire is designed to be extremely pluggable. Some ways you might want to extend it are (from simplest to most involved):
- Adding a new sampling strategy, to get finer grained control over how instances are allocated to trees, or between the training set and the test set: define a new Sampler
- Add a new evaluation strategy (such as log-likelihood or entropy) or stopping criteria: define a new Evaluator
- Adding a new feature type, or a new way of binning an existing feature type (such as log-binning real numbers): define a new Splitter
- Adding a new target type (such as real-valued targets for regression trees): define a new Evaluator, and quite likely also define a new Splitter for any continuous or sparse feature types you want to be able to use.
- Add a new distributed computation platform: define a new equivalent of Trainer, idiomatically to the platform you're using. (There's no specific interface this should implement.)