Exception when using RDD.takeOrdered with BlueMix Apache Spark Service
billreed63 opened this issue · 1 comments
billreed63 commented
this code example runs on local spark (even local spark cluster)
var eclairjs = require('eclairjs');
var spark = new eclairjs();
var session = spark.sql.SparkSession.builder()
.appName("test")
.getOrCreate();
var sc = session.sparkContext();
var rdd = sc.parallelize([1,2,3]);
rdd.takeOrdered(2, function (x) {
return 0;
});
but throws exception on BlueMix Spark service
Name: org.apache.spark.SparkException
Message: Job aborted due to stage failure: ClassNotFound with classloader: org.apache.spark.util.MutableURLClassLoader@de7ec20
StackTrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1461)
billreed63 commented
This seems to be an issue with the BlueMix service, I am able to reproduce the error by using a scala spark 2.0 notebook on the BlueMix service, EclairJS is not in the mix:
var builder1 = org.apache.spark.sql.SparkSession.builder();
var sparkSession1 = builder1.getOrCreate();
var sparkContext1 = sparkSession1.sparkContext;
sparkContext1.version;
var javaSC = new org.apache.spark.api.java.JavaSparkContext(sparkContext1);
var rdd = javaSC.parallelizeDoubles(java.util.Arrays.asList(1.0, 2.0, 3.0, 4.0));
rdd.count;
class DoubleComparator extends java.util.Comparator[java.lang.Double] with java.io.Serializable {
def compare(o1: java.lang.Double, o2: java.lang.Double) = o1.compareTo(o2)
}
var rdd2 = rdd.takeOrdered(2, new DoubleComparator());
The exception displayed in the notebook is:
Name: org.apache.spark.SparkException
Message: Job aborted due to stage failure: ClassNotFound with classloader: org.apache.spark.util.MutableURLClassLoader@de7ec20
StackTrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1461)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1449)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1448)
As a work around convert the RDD to a Dataset/Dataframe and use a sort to order the results and then a take.