databricks/spark-sklearn

Crashing for larger data set

manjush3v opened this issue · 2 comments

I am running spark context with specification -

from pyspark import SparkConf, SparkContext
conf = (SparkConf()
         .setMaster("spark-master-url")
         .setAppName("PySparkShell")
         .set("spark.executor.memory", "6800M"))
sc = SparkContext(conf = conf)

The program is working fine when X_train length is 5000 but fails when the size is increased to 12000.

spark keeps crashing with following errors -

 Lost task 13.0 in stage 1.0 (TID 109, 172.31.8.203, executor 1): org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
        at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:230)
        at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
        at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:99)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:392)
        at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
        ... 11 more

More details here

This is really more of a Spark issue. I don't see spark-sklearn here.