A collection of simple utility programs to benchmark different PySpark scenarios
Module to benchmark different scenarios related to get or create PySpark session
spark-submit
adds tiny 2s overheadspark-submit
will not create session in the background. It waits till firstgetOrCreate
call is made- Even if ran with
python get_or_create.py
total time equalsspark-submit
+ firstgetOrCreate
call - First call for
getOrCreate
in aspark-submit
takes time, next calls are instant - Stopping session and creating new using
stop
thengetOrCreate
takes approximately 80% of time to the initialgetOrCreate
- When trying to run with more RAM, initial call will take more time and
spark-submit
overhead also increases. Although this does not seem to impact subsequent get or create calls too once spark is initialized
- Test in cluster mode
- Test start in client mode stop and start in cluster mode. vice versa