Big performance issue when moving from 2.0.1 to 4.0.0 when loading column of type ArrayType

Question

Big performance issue when moving from 2.0.1 to 4.0.0 when loading column of type ArrayType

arthurdk opened this issue 7 years ago · 4 comments

Hello,

We just upgraded our stack from Spark 1.6 to Spark 2.2 and with that me moved from com.databricks:spark-avro_2.10:2.0.1 to com.databricks:spark-avro_2.11:4.0.0.

We noticed a huge increase in the running time in one of our script. Here is the schema of the files we are loading from HDFS:

df.printSchema
root
|-- field1: string (nullable = true)
|-- field2: string (nullable = true)
|-- field3: array (nullable = true)
|    |-- element: struct (containsNull = false)
|    |    |-- _0: string (nullable = true)
|    |    |-- _1: integer (nullable = true)
|    |    |-- _2: long (nullable = true)

In Spark 1 our script runs in ~2 minutes vs ~40 minutes in Spark 2.

At first I suspected, our script & user defined functions to be quite slow. But then I updated the script to simply read & write our file:

val df = spark.read.avro("/path/to/file/in").write.avro("/path/to/file/out")

And we were still facing the same performance issue: in Spark 1 this runs in ~2 minutes and in Spark 2 this runs in ~40 minutes.

To give your more info on the files we are loading : there are ~2 500 000 entries are the number of struct elements in the array can be quite high:

val df = spark.read.avro("/path/to/file/in")
df.select(size(col("field3")).as("size")).select(avg(col("size")), min(col("size")), max(col("size"))).show
+-----------------+---------+---------+
|        avg(size)|min(size)|max(size)|
+-----------------+---------+---------+
|133.0953942943108|        1|   143220|
+-----------------+---------+---------+

Could you look into this? If you need any additional information feel free to ask!

Answer 1 · 2018-02-05T09:55:03.000Z

The read and write path is indeed slower in current release.
For 2.0.1 version:

read path:   Avro => Row
write path: Row => Avro

while in 4.0:

read path:   Avro => Row => InternalRow
write path:  InternalRow => Row => Avro

The conversion between Row and InternalRow is slow.

The upside is that computation is faster: https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html

In the next release, this problem should be fixed as:

read path:   Avro => InternalRow
write path:  InternalRow => Avro

Answer 2 · 2018-02-05T10:40:21.000Z

Many thanks for the explanation! Do you have an ETA for the next release?

Answer 3 · 2018-02-05T10:49:46.000Z

There is not ETA yet. I will comment this issue once fixed.

Answer 4 · 2018-08-09T11:48:50.000Z

May I ask if this issue is already fixed?

Our test AVRO file has more than 10% performance downgrade compared to spark 1.6.