camelCaseToSnakeCaseColumns doesn't seem to work
Closed this issue · 4 comments
I have tried using the camelCaseToSnakeCaseColumns transform but it doesn't seem to be working when combined with other transformations.
Here is the production code:
def standardRefinedColumnCleaning()(df: Dataset[Row]): Dataset[Row] = {
df.transform(snakeCaseColumns())
.transform(camelCaseToSnakeCaseColumns())
.transform(sortColumns())
}
And the test case
it("should snake case columns for columns with spaces or camel cased") {
val someDF = Seq(
("foo", "bar")
).toDF("SomeColumn", "A b C")
val refinedDataSet = RefinedTransforms.standardRefinedColumnCleaning()(someDF)
assert(refinedDataSet.columns == Array("a_b_c", "some_column"))
}
The result is
Expected :Array("a_b_c", "some_column")
Actual :Array("a_b_c", "somecolumn")
If I only run the camelCaseToSnakeCaseColumns
then it works
@darrenhaken - Thanks for opening this issue. I added a snakifyColumns()
method that uses the snakify()
method of the Lyft web development framework.
This should meet your needs. I'll do a release now.
@darrenhaken - I did a release that includes the snakifyColumns()
method, see version 0.28.0.
Let me know how this works for you!
I still had some weird behaviour by combining the transforms; the snakifyColumns()
didn't cover all use cases.
I managed to get it working by experimenting with the order of transforms.
Here is what I got in the end so you can see what I mean along with the test case.
df
.transform(sortColumns())
.transform(snakifyColumns())
.transform(camelCaseToSnakeCaseColumns())
.transform(snakeCaseColumns())
it("should snake case columns for columns with spaces or camel cased") {
val someDF = Seq(
("foo", "bar", "car")
).toDF("SomeColumn", "Another Column", "BAR_COLUMN")
val refinedDataSet = RefinedTransforms.standardRefinedColumnCleaning()(someDF)
assert(refinedDataSet.columns.toSeq == Seq("another_column", "bar_column", "some_column"))
}
Does that help and make sense?
@darrenhaken - glad you were able to get this working!
@gorros created a cool renameColumns()
DataFrame extension that can also be used to solve this problem.
In this example, the logic required is a bit complex, but it's easily customizable for all the edge cases in your data.
val df = spark
.createDF(
List(
("foo", "bar", "car")
),
List(
("SomeColumn", StringType, true),
("Another Column", StringType, true),
("BAR_COLUMN", StringType, true)
)
)
.renameColumns(
_.trim
.replaceAll(
"[\\s-]+",
"_"
)
.replaceAll(
"([A-Z]+)([A-Z][a-z])",
"$1_$2"
)
.replaceAll(
"([a-z\\d])([A-Z])",
"$1_$2"
)
.toLowerCase
)
df.columns.toList ==> Seq(
"some_column",
"another_column",
"bar_column"
)
Let me know what you think!