mrpowers-io/spark-daria

Does withColumnRenamed have a hidden cost similar to the hidden cost of withColumn

Closed this issue · 1 comments

@manuzhang has a great article on the hidden cost of withColumn.

This code isn't fully optimized by the Catalyst optimizer and isn't performant.

df.columns.foldLeft(df) { case (df, col) =>
  df.withColumn(col, df(col).cast(IntegerType))
}

This code is better and runs faster:

df.select(df.columns.map { col =>
  df(col).cast(IntegerType)
}: _*)

Does withColumnRenamed have the same hidden cost?

Here's how spark-daria implements the renameColumns method:

def renameColumns(f: String => String): DataFrame =
  df.columns.foldLeft(df)((tempDf, c) => tempDf.withColumnRenamed(c, f(c))
)

Should the renameColumns method be reimplemented with a select? cc: @nvander1 @snithish @gorros

Afaik it would have the same issue that withColumn does since it has to run the analyzer for each dataframe along the way.