Featran, also known as Featran77 or F77 (get it?), is a Scala library for feature transformation. It aims to simplify the time consuming task of feature engineering in data science and machine learning processes. It supports various collection types for feature extraction and output formats for feature representation.
Most feature transformation logic requires two steps, one global aggregation to summarize data followed by one element-wise mapping to transform them. For example:
- Min-Max Scaler
- Aggregation: global min & max
- Mapping: scale each value to
[min, max]
- One-Hot Encoder
- Aggregation: distinct labels
- Mapping: convert each label to a binary vector
We can implement this in a naive way using reduce
and map
.
case class Point(score: Double, label: String)
val data = Seq(Point(1.0, "a"), Point(2.0, "b"), Point(3.0, "c"))
val a = data
.map(p => (p.score, p.score, Set(p.label)))
.reduce((x, y) => (math.min(x._1, y._1), math.max(x._2, y._2), x._3 ++ y._3))
val features = data.map { p =>
(p.score - a._1) / (a._2 - a._1) :: a._3.toList.sorted.map(s => if (s == p.label) 1.0 else 0.0)
}
But this is unmanageable for complex feature sets. The above logic can be easily expressed in Featran.
import com.spotify.featran._
import com.spotify.featran.transformers._
val fs = FeatureSpec.of[Point]
.required(_.score)(MinMaxScaler("min-max"))
.required(_.label)(OneHotEncoder("one-hot"))
val fe = fs.extract(data)
val names = fe.featureNames
val features = fe.featureValues[Seq[Double]]
See Example.scala for more example usage. See transformers for a complete list of available feature transformers.
Featran also supports these additional features.
- Extract from Scala collections, Flink
DataSet
s, ScaldingTypedPipe
s, ScioSCollection
s and SparkRDD
s - Output as Scala collections, Breeze dense and sparse vectors, TensorFlow Example Protobuf and NumPy
.npy
file - Import aggregation from a previous extraction for training, validation and test sets
- Compose feature specifications and separate outputs
See ScalaDocs for current API documentation.
Feature includes the following artifacts:
featran-core
- Core library, supports extraction from Scala collections and output as Scala collections, Breeze dense and sparse vectors.featran-flink
- support for extraction from FlinkDataSet
featran-scalding
- support for extraction from ScaldingTypedPipe
featran-scio
- support for extraction from ScioSCollection
featran-spark
- support for extraction from SparkRDD
featran-tensorflow
- suppoprt for output as TensorFlow Example Protobuffeatran-numpy
- support for output as NumPy.npy
file
Copyright 2016-2017 Spotify AB.
Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0