Exception when using RDD.takeOrdered with BlueMix Apache Spark Service

Question

Exception when using RDD.takeOrdered with BlueMix Apache Spark Service

billreed63 opened this issue 8 years ago · 1 comments

this code example runs on local spark (even local spark cluster)


var eclairjs = require('eclairjs');
var spark = new eclairjs();
var session = spark.sql.SparkSession.builder()
  .appName("test")
  .getOrCreate();
var sc = session.sparkContext();
var rdd = sc.parallelize([1,2,3]);
rdd.takeOrdered(2, function (x) {
       return 0;
});

but throws exception on BlueMix Spark service

Name: org.apache.spark.SparkException
Message: Job aborted due to stage failure: ClassNotFound with classloader: org.apache.spark.util.MutableURLClassLoader@de7ec20
StackTrace:   at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1461)

Answer 1 · 2017-01-09T19:10:26.000Z

This seems to be an issue with the BlueMix service, I am able to reproduce the error by using a scala spark 2.0 notebook on the BlueMix service, EclairJS is not in the mix:

var builder1 = org.apache.spark.sql.SparkSession.builder();
var sparkSession1 = builder1.getOrCreate();
var sparkContext1 = sparkSession1.sparkContext;
sparkContext1.version;

var javaSC = new org.apache.spark.api.java.JavaSparkContext(sparkContext1);
var rdd = javaSC.parallelizeDoubles(java.util.Arrays.asList(1.0, 2.0, 3.0, 4.0));
rdd.count;

class DoubleComparator extends java.util.Comparator[java.lang.Double] with java.io.Serializable {
 def compare(o1: java.lang.Double, o2: java.lang.Double) = o1.compareTo(o2)
}

var rdd2 = rdd.takeOrdered(2, new DoubleComparator());

The exception displayed in the notebook is:

Name: org.apache.spark.SparkException
Message: Job aborted due to stage failure: ClassNotFound with classloader: org.apache.spark.util.MutableURLClassLoader@de7ec20
StackTrace:   at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1461)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1449)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1448)

As a work around convert the RDD to a Dataset/Dataframe and use a sort to order the results and then a take.