/spark-stratifier

Stratified Cross Validator for Spark

Primary LanguagePython

spark-stratifier

PyPI version Start with Why

When we first started working Spark at HackerRank, we realized that within our dataset, the size of our outcome sets varied in size by quite a bit. This led to inconsistent model cross validation and training. However, with stratified sampling, we were able to eliminate these inconsistencies and improve overall model predictions. The goal of spark-stratifier is to provide a tool to stratify datasets for cross validation in PySpark. This class extends the current CrossValidator class in Spark.

Currently, the stratified cross validator works with binary classification problems using labels 0 and 1.

Read more at engineering.hackerrank.com

Requirements

This tool is 100% Python and the only primary requirements are numpy and pyspark.

Installation

$ pip install spark-stratifier

Example

You basically use this the exact same way you would with the Spark CrossValidator... except this time, your data will be stratified.

from spark_stratifier import StratifiedCrossValidator

scv = StratifiedCrossValidator(
        estimator=pipeline,
        estimatorParamMaps=paramGrid,
        evaluator=evaluator,
        numFolds=8
      )

model = scv.fit(matrix)

Contributing

contributions welcome

If you want to write some code and contribute to this project, go ahead and start a pull request. We hope this tool is useful for the community and we'd love to hear about how this helps solve your problems!