
Tips for developing Apache Spark, especially in IntelliJ IDEA

Creative Commons Zero v1.0 UniversalCC0-1.0

Spark Development Tips

PySpark setup in IDEA

First, follow the documented setup instructions for IntelliJ IDEA.

Next, ensure you have a valid Python version installed on your machine, and set up an SDK pointing to it. One that is managed by pyenv will work fine.

Python SDK in IDEA

Next, add the python directory as a module, using File/New/Module from Existing Sources.... Next, in the settings for the new module, associate it with the Python SDK just created above.

Associate SDK with module

Now, do a full project build to ensure there are no errors.

Debugging PySpark tests

With all of the above done, you should be able to debug the tests under python/pyspark/tests, after a few environment variables are set up. The easiest way to do this is to create a new debug configuration for one of the Python tests (which will probably fail initially), then edit it. You will need to set the following:

  • Working directory: /path/to/source/spark
  • Environment variables:
    • PYSPARK_PYTHON = /path/to/your/python (the same one the SDK points to)

Set up debug configuration

At this point, you should be able to re-run the debug configuration and hit breakpoints.