assert_df_equality throws SchemasNotEqualError when the dataframes are identical (except for the metadata)

Question

assert_df_equality throws SchemasNotEqualError when the dataframes are identical (except for the metadata)

Closed this issue a year ago · 3 comments

I have a test where i define how the productive table will be created. I'm setting some comments to the columns so the user that consumes this table can understand what that column does. The problem is that when I make a test of that table with a custom dataframe, chispa throws me an exception due to schema mismatch.

Example:

spark.sql("""
CREATE TABLE IF NOT EXISTS foo (
    id LONG COMMENT "a comment",
    value INT
)
""")
spark.sql("INSERT INTO foo values (1,1)")

df = spark.table("foo")
schema = T.StructType([
    T.StructField("id", T.LongType(), True),
    T.StructField("value", T.IntegerType(), True),
])
expected = spark.createDataFrame(data=[(1, 1)], schema=schema)

assert_df_equality(df, expected)

The assertion fails for the schema, the output shows that value is identical (because it has no metadata) but the id is not equal (but it seems identical). If you remove the "COMMENT" section from the table creation, the test pass. Being forced to add the metadata in the struct type its way more tedious, is there a chance to ignore the metadata using a boolean (ignore_schema_metadata)?

Answer 1 · 2023-09-26T04:33:17.000Z

@fedemgp - thanks for reporting this. ignore_schema_metadata sounds like a good suggestion.

Answer 2 · 2023-10-01T03:19:08.000Z

Here's a PR to add this option: #74

Answer 3 · 2023-10-02T15:08:07.000Z

Awesome, thanks!! i will try it