Big performance issue when moving from 2.0.1 to 4.0.0 when loading column of type ArrayType
arthurdk opened this issue · 4 comments
Hello,
We just upgraded our stack from Spark 1.6 to Spark 2.2 and with that me moved from com.databricks:spark-avro_2.10:2.0.1
to com.databricks:spark-avro_2.11:4.0.0
.
We noticed a huge increase in the running time in one of our script. Here is the schema of the files we are loading from HDFS:
df.printSchema
root
|-- field1: string (nullable = true)
|-- field2: string (nullable = true)
|-- field3: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- _0: string (nullable = true)
| | |-- _1: integer (nullable = true)
| | |-- _2: long (nullable = true)
In Spark 1 our script runs in ~2 minutes vs ~40 minutes in Spark 2.
At first I suspected, our script & user defined functions to be quite slow. But then I updated the script to simply read & write our file:
val df = spark.read.avro("/path/to/file/in").write.avro("/path/to/file/out")
And we were still facing the same performance issue: in Spark 1 this runs in ~2 minutes and in Spark 2 this runs in ~40 minutes.
To give your more info on the files we are loading : there are ~2 500 000 entries are the number of struct elements in the array can be quite high:
val df = spark.read.avro("/path/to/file/in")
df.select(size(col("field3")).as("size")).select(avg(col("size")), min(col("size")), max(col("size"))).show
+-----------------+---------+---------+
| avg(size)|min(size)|max(size)|
+-----------------+---------+---------+
|133.0953942943108| 1| 143220|
+-----------------+---------+---------+
Could you look into this? If you need any additional information feel free to ask!
The read and write path is indeed slower in current release.
For 2.0.1 version:
read path: Avro => Row
write path: Row => Avro
while in 4.0:
read path: Avro => Row => InternalRow
write path: InternalRow => Row => Avro
The conversion between Row
and InternalRow
is slow.
The upside is that computation is faster: https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html
In the next release, this problem should be fixed as:
read path: Avro => InternalRow
write path: InternalRow => Avro
Many thanks for the explanation! Do you have an ETA for the next release?
There is not ETA yet. I will comment this issue once fixed.
May I ask if this issue is already fixed?
Our test AVRO file has more than 10% performance downgrade compared to spark 1.6.