Does withColumnRenamed have a hidden cost similar to the hidden cost of withColumn
Closed this issue · 1 comments
MrPowers commented
@manuzhang has a great article on the hidden cost of withColumn.
This code isn't fully optimized by the Catalyst optimizer and isn't performant.
df.columns.foldLeft(df) { case (df, col) =>
df.withColumn(col, df(col).cast(IntegerType))
}
This code is better and runs faster:
df.select(df.columns.map { col =>
df(col).cast(IntegerType)
}: _*)
Does withColumnRenamed
have the same hidden cost?
Here's how spark-daria implements the renameColumns
method:
def renameColumns(f: String => String): DataFrame =
df.columns.foldLeft(df)((tempDf, c) => tempDf.withColumnRenamed(c, f(c))
)
Should the renameColumns
method be reimplemented with a select? cc: @nvander1 @snithish @gorros
nvander1 commented
Afaik it would have the same issue that withColumn
does since it has to run the analyzer for each dataframe along the way.