Pandas Dataframe with incorrect number of columns returned internally
lfdversluis opened this issue · 1 comments
lfdversluis commented
I am running a method using apply on various datasets, while running it one of the datasets (https://zenodo.org/record/3254606) this error was thrown. Any idea what can cause this? It seems it's internal code and unrelated to my pyspark script.
21/02/11 17:39:23 WARN TaskSetManager: Lost task 0.3 in stage 5.0 (TID 810, 10.149.0.63, executor 3): org.apache.spark.api.python.PythonException: Tracebac
k (most recent call last):
File "/var/scratch/lvs215/big-data-frameworks/spark-3.0.0/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main
process()
File "/var/scratch/lvs215/big-data-frameworks/spark-3.0.0/python/lib/pyspark.zip/pyspark/worker.py", line 597, in process
serializer.dump_stream(out_iter, outfile)
File "/var/scratch/lvs215/big-data-frameworks/spark-3.0.0/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 255, in dump_stream
return ArrowStreamSerializer.dump_stream(self, init_stream_yield_batches(), stream)
File "/var/scratch/lvs215/big-data-frameworks/spark-3.0.0/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 88, in dump_stream
for batch in iterator:
File "/var/scratch/lvs215/big-data-frameworks/spark-3.0.0/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 248, in init_stream_yield_batch
es
for series in iterator:
File "/var/scratch/lvs215/big-data-frameworks/spark-3.0.0/python/lib/pyspark.zip/pyspark/worker.py", line 429, in mapper
return f(keys, vals)
File "/var/scratch/lvs215/big-data-frameworks/spark-3.0.0/python/lib/pyspark.zip/pyspark/worker.py", line 175, in <lambda>
return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))]
File "/var/scratch/lvs215/big-data-frameworks/spark-3.0.0/python/lib/pyspark.zip/pyspark/worker.py", line 169, in wrapped
raise RuntimeError(
RuntimeError: Number of columns of the returned pandas.DataFrame doesn't match specified schema. Expected: 4 Actual: 3
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:503)
at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:99)
at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:49)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:456)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4.agg_doAggregateWithoutKey_1$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4.agg_doAggregateWithoutKey_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
21/02/11 17:39:23 ERROR TaskSetManager: Task 0 in stage 5.0 failed 4 times; aborting job
lfdversluis commented
I think this is a bug on my end.