MrPowers/spark-daria

camelCaseToSnakeCaseColumns doesn't seem to work

Closed this issue · 4 comments

I have tried using the camelCaseToSnakeCaseColumns transform but it doesn't seem to be working when combined with other transformations.

Here is the production code:

def standardRefinedColumnCleaning()(df: Dataset[Row]): Dataset[Row] = {
    df.transform(snakeCaseColumns())
      .transform(camelCaseToSnakeCaseColumns())
      .transform(sortColumns())
  }

And the test case

it("should snake case columns for columns with spaces or camel cased") {
      val someDF = Seq(
        ("foo", "bar")
      ).toDF("SomeColumn", "A b C")

      val refinedDataSet = RefinedTransforms.standardRefinedColumnCleaning()(someDF)

      assert(refinedDataSet.columns == Array("a_b_c", "some_column"))
    }

The result is

Expected :Array("a_b_c", "some_column")
Actual   :Array("a_b_c", "somecolumn")

If I only run the camelCaseToSnakeCaseColumns then it works

@darrenhaken - Thanks for opening this issue. I added a snakifyColumns() method that uses the snakify() method of the Lyft web development framework.

This should meet your needs. I'll do a release now.

@darrenhaken - I did a release that includes the snakifyColumns() method, see version 0.28.0.

Let me know how this works for you!

I still had some weird behaviour by combining the transforms; the snakifyColumns() didn't cover all use cases.

I managed to get it working by experimenting with the order of transforms.
Here is what I got in the end so you can see what I mean along with the test case.

 df
      .transform(sortColumns())
      .transform(snakifyColumns())
      .transform(camelCaseToSnakeCaseColumns())
      .transform(snakeCaseColumns())
it("should snake case columns for columns with spaces or camel cased") {
      val someDF = Seq(
        ("foo", "bar", "car")
      ).toDF("SomeColumn", "Another Column", "BAR_COLUMN")

      val refinedDataSet = RefinedTransforms.standardRefinedColumnCleaning()(someDF)

      assert(refinedDataSet.columns.toSeq == Seq("another_column", "bar_column", "some_column"))
    }

Does that help and make sense?

@darrenhaken - glad you were able to get this working!

@gorros created a cool renameColumns() DataFrame extension that can also be used to solve this problem.

In this example, the logic required is a bit complex, but it's easily customizable for all the edge cases in your data.

val df = spark
  .createDF(
    List(
      ("foo", "bar", "car")
    ),
    List(
      ("SomeColumn", StringType, true),
      ("Another Column", StringType, true),
      ("BAR_COLUMN", StringType, true)
    )
  )
  .renameColumns(
    _.trim
      .replaceAll(
        "[\\s-]+",
        "_"
      )
      .replaceAll(
        "([A-Z]+)([A-Z][a-z])",
        "$1_$2"
      )
      .replaceAll(
        "([a-z\\d])([A-Z])",
        "$1_$2"
      )
      .toLowerCase
  )

df.columns.toList ==> Seq(
  "some_column",
  "another_column",
  "bar_column"
)

Let me know what you think!