spark-stratifier

When we first started working Spark at HackerRank, we realized that within our dataset, the size of our outcome sets varied in size by quite a bit. This led to inconsistent model cross validation and training. However, with stratified sampling, we were able to eliminate these inconsistencies and improve overall model predictions. The goal of spark-stratifier is to provide a tool to stratify datasets for cross validation in PySpark. This class extends the current CrossValidator class in Spark.

Currently, the stratified cross validator works with binary classification problems using labels 0 and 1.

Requirements

This tool is 100% Python and the only primary requirements are numpy and pyspark.

Installation

$ pip install spark-stratifier

Example

You basically use this the exact same way you would with the Spark CrossValidator... except this time, your data will be stratified.

from spark_stratifier import StratifiedCrossValidator

scv = StratifiedCrossValidator(
        estimator=pipeline,
        estimatorParamMaps=paramGrid,
        evaluator=evaluator,
        numFolds=8
      )

model = scv.fit(matrix)

Contributing

If you want to write some code and contribute to this project, go ahead and start a pull request. We hope this tool is useful for the community and we'd love to hear about how this helps solve your problems!

interviewstreet/spark-stratifier

spark-stratifier

Requirements

Installation

Example

Contributing