/spark-pyrest

A simple python package for querying Apache Spark's REST API

Primary LanguagePythonMIT LicenseMIT

spark-pyrest

A simple python package for querying Apache Spark's REST API:

  • simple interface
  • returns DataFrames of task metrics for easy slicing and dicing
  • easily access executor logs for post-processing.

Installation and dependencies

spark-pyrest uses pandas.

To install:

$ git clone https://github.com/rokroskar/spark-pyrest.git
$ cd spark-pyrest
$ python setup.py install

Usage

Initializing the SparkPyRest object with a host address

from spark_pyrest import SparkPyRest

spr = SparkPyRest(host)

Get the current app

spr.app
u'app-20170420091222-0000'

Get the running/completed stages

spr.stages

[(2,
  u'groupByKey '),
 (1,
  u'partitionBy'),
 (0,
  u'partitionBy'),
 (4, u'runJob at PythonRDD.scala:441'),
 (3,
  u'distinct')]

Get task metrics for a stage

pr.tasks(1)
bytesWritten executorId executorRunTime host localBytesRead remoteBytesRead taskId
0 3508265 33 354254 172.19.1.124 0 112 25604
1 3651003 56 353554 172.19.1.74 0 112 24029
2 3905955 9 347724 172.19.1.62 0 111 19719

Check some properties aggregated by host/executor

host_tasks = spr.tasks(1)[['host','bytesWritten','remoteBytesRead','executorRunTime']].groupby('host')

host_tasks.describe()
bytesWritten executorRunTime remoteBytesRead
host
172.19.1.100 count 1.270000e+02 127.000000 1.270000e+02
mean 6.436659e+06 337587.913386 3.230165e+06
std 1.744411e+06 114722.662028 5.590263e+05
min 0.000000e+00 0.000000 1.098278e+06
25% 6.426009e+06 292290.000000 2.967360e+06
50% 6.805269e+06 348174.000000 3.315189e+06
75% 7.160529e+06 420091.500000 3.582200e+06
max 8.050828e+06 500831.000000 4.775107e+06
172.19.1.101 count 1.300000e+02 130.000000 1.300000e+02
mean 6.470575e+06 335401.584615 3.178066e+06
std 1.725755e+06 115268.286668 6.332040e+05
min 0.000000e+00 0.000000 0.000000e+00
25% 6.534020e+06 290462.750000 2.965009e+06
50% 6.880597e+06 356160.000000 3.327600e+06
75% 7.197086e+06 407338.750000 3.558479e+06
max 7.983406e+06 515372.000000 4.526526e+06

Plot the data

mean_runtime = host_tasks.mean()['executorRunTime']

mean_runtime.plot(kind='bar')

mean runtime example

Grab the log of an executor

If your application prints its own metrics to stdout/stderr, you need to be able to grab the executor logs to see these metrics. The logs are located on the hosts that make up your cluster, so accessing them programmatically can be tedious. You can see them through the Spark web UI of course, but processing them that way is not really useful. Using the executor_log method of SparkPyRest, you can grab the full contents of the log for easy post-processing/data extraction.

log = spr.executor_log(0)
print log
==== Bytes 0-301839 of 301839 of /tmp/work/app-20170420091222-0000/0/stderr ====
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/04/20 09:12:23 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 14086@x04y08
17/04/20 09:12:23 INFO SignalUtils: Registered signal handler for TERM
17/04/20 09:12:23 INFO SignalUtils: Registered signal handler for HUP
17/04/20 09:12:23 INFO SignalUtils: Registered signal handler for INT
17/04/20 09:12:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/04/20 09:12:23 INFO SecurityManager: Changing view acls to: roskar
17/04/20 09:12:23 INFO SecurityManager: Changing modify acls to: roskar
17/04/20 09:12:23 INFO SecurityManager: Changing view acls groups to: 
17/04/20 09:12:23 INFO SecurityManager: Changing modify acls groups to: 

Contributing

Please submit an issue if you discover a bug or have a feature request! Pull requests also very welcome.