/mockrdd

A Python3 module for testing PySpark code

Primary LanguagePythonMIT LicenseMIT

mockrdd

A Python3 module for testing PySpark code.

Build Status

The mockrdd.MockRDD class offers similar behavior to pyspark.RDD with the following extra benefits.

  • Extensive sanity checks to identify invalid inputs
  • More meaningful error messages for debugging issues
  • Straightforward to running within pdb
  • Removes Spark dependencies from development and testing environments
  • No Spark overhead when running through a large test suite

See our blog post Introducing MockRDD for testing PySpark code for additional details.

Here's a simple example of using MockRDD in a test.

from mockrdd import MockRDD

def job(rdd):
    return rdd.map(lambda x: x*2).filter(lambda x: x>3)
   
assert job(MockRDD.empty()).collect() == [] 
assert job(MockRDD.of(1)).collect() == [] 
assert job(MockRDD.of(2)).collect() == [4] 

Conventionally, you'd include a main method to create an RDD hooked up to appropriate sources and sinks. Further, the testing would be included in a separate file and use the module unittest for defining test cases.

See the docstring of mockrdd.MockRDD for more information.