Spark logistic regression (for comparison)
szilard opened this issue · 3 comments
szilard commented
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
val d_train = spark.read.parquet("spark_ohe-train.parquet").cache()
val d_test = spark.read.parquet("spark_ohe-test.parquet").cache()
(d_train.count(), d_test.count())
val lr = new LogisticRegression()
val pipeline = new Pipeline().setStages(Array(lr))
val now = System.nanoTime
val model = pipeline.fit(d_train)
val elapsed = ( System.nanoTime - now )/1e9
elapsed
val predictions = model.transform(d_test)
val evaluator = new BinaryClassificationEvaluator().setLabelCol("label").setRawPredictionCol("probability").setMetricName("areaUnderROC")
evaluator.evaluate(predictions)
szilard commented
szilard commented
compare to h2o:
library(h2o)
h2o.init()
dx_train <- h2o.importFile("train-10m.csv")
dx_test <- h2o.importFile("test.csv")
Xnames <- names(dx_train)[which(names(dx_train)!="dep_delayed_15min")]
system.time({
md <- h2o.glm(x = Xnames, y = "dep_delayed_15min", training_frame = dx_train, family = "binomial")
})
h2o.auc(h2o.performance(md, dx_test))
10M:
data RAM 4G
6sec
0.7081992
total RAM 6G
100M:
dx_train0 <- h2o.importFile("train-10m.csv")
dx_train <- h2o.rbind(dx_train0, dx_train0, dx_train0, dx_train0, dx_train0, dx_train0, dx_train0, dx_train0, dx_train0, dx_train0)
data RAM 6G
36 sec
0.7081992
total RAM 11G
szilard commented
10M | 100M | |||
---|---|---|---|---|
Spark | h2o | Spark | h2o | |
time [s] | 20 | 6 | 155 | 36 |
AUC | 0.709 | 0.708 | 0.709 | 0.708 |
data RAM [GB] | 10 | 4 | 60 | 6 |
data+train RAM [GB] | 22 | 6 | 110 | 11 |