Yelp/mrjob

Spark harness is not populating counters when counter-output-dir is not an S3 path

88manpreet opened this issue · 1 comments

Hadoop Counters, an integral feature of Hadoop map-reduce provides a way to measure the progress or the number of operations that occur within map/reduce job.
Spark harness runs the regular hadoop streaming job on spark using Spark Runner.
Spark harness emulates the counters feature to run the same hadoop-streaming job onto spark without requiring any modifications to the job.
Harness script stores the calculated counters value to the given counter_output_dir (--spark-tmp-dir) using saveAsTextFile spark api. The populated counters values is in turn read by spark runner to be provided to application user.
The logic works well if the the path for --spark-tmp-dir is an S3 path. With the regular local file-path (by default), saveAsTextfile creates the counters file (part-*) on spark executors but not on the driver local file-path. Unless, the executors are running on the same host as drivers.

Fixed by #2177.