DataFramesNotEqualError when dataframes appear identical

Question

DataFramesNotEqualError when dataframes appear identical

bweisberg opened this issue 3 years ago · 6 comments

I have two dataframes that appear identical but assert_approx_df_equality is throwing DataFramesNotEqual error. There may be an intermittent going on because this code passed on the development cluster but failed in the test pipeline. Also, changing the precision from 0.001 to 1.0 allows the test to pass, although I don't see any differences in the actual vs. expected output.

actual_df = ...create the dataframe with my component...

expected_data = [ 
        ('POINT (2.5 1.5)', 1.0, 1.0, 0.7071067811865476, 2.0, 2.0, False),
        ('POINT (2.55 2.25)', 2.0, 2.0, 0.14142135623730964, 2.65, 2.35, False),
        ('POINT (4.75 2.5)', 3.0, 3.0, 0.5, 5.25, 2.5, False),
        ('POINT EMPTY', 4.0, None, -999.0, float('nan'), float('nan'), False)
     ]
expected_df = (spark.createDataFrame(expected_data, ["wkt", "point_id", "poly_id", "distance", "X", "Y", "isOnRight"])).sort("point_id")

actual_df.show()
expected_df.show()

assert_approx_df_equality(actual_df, expected_df, 0.001, ignore_nullable=True)

the output of the show commands:

+-----------------+--------+-------+-------------------+----+----+---------+
|              wkt|point_id|poly_id|           distance|   X|   Y|isOnRight|
+-----------------+--------+-------+-------------------+----+----+---------+
|  POINT (2.5 1.5)|     1.0|    1.0| 0.7071067811865476| 2.0| 2.0|    false|
|POINT (2.55 2.25)|     2.0|    2.0|0.14142135623730964|2.65|2.35|    false|
| POINT (4.75 2.5)|     3.0|    3.0|                0.5|5.25| 2.5|    false|
|      POINT EMPTY|     4.0|   null|             -999.0| NaN| NaN|    false|
+-----------------+--------+-------+-------------------+----+----+---------+

+-----------------+--------+-------+-------------------+----+----+---------+
|              wkt|point_id|poly_id|           distance|   X|   Y|isOnRight|
+-----------------+--------+-------+-------------------+----+----+---------+
|  POINT (2.5 1.5)|     1.0|    1.0| 0.7071067811865476| 2.0| 2.0|    false|
|POINT (2.55 2.25)|     2.0|    2.0|0.14142135623730964|2.65|2.35|    false|
| POINT (4.75 2.5)|     3.0|    3.0|                0.5|5.25| 2.5|    false|
|      POINT EMPTY|     4.0|   null|             -999.0| NaN| NaN|    false|
+-----------------+--------+-------+-------------------+----+----+---------+

The exception shows the last three rows are different though I can't spot the differences.

DataFramesNotEqualError                   Traceback (most recent call last)
<command-340851985589312> in <module>
     50 expected_df.show()
     51 
---> 52 assert_approx_df_equality(actual_df, expected_df, 0.001, ignore_nullable=True)

/databricks/python/lib/python3.7/site-packages/chispa/dataframe_comparer.py in assert_approx_df_equality(df1, df2, precision, ignore_nullable)
     38 def assert_approx_df_equality(df1, df2, precision, ignore_nullable=False):
     39     assert_schema_equality(df1.schema, df2.schema, ignore_nullable)
---> 40     assert_generic_rows_equality(df1, df2, are_rows_approx_equal, [precision])
     41 
     42 

/databricks/python/lib/python3.7/site-packages/chispa/dataframe_comparer.py in assert_generic_rows_equality(df1, df2, row_equality_fun, row_equality_fun_args)
     62             t.add_row([r1, r2])
     63     if allRowsEqual == False:
---> 64         raise DataFramesNotEqualError("\n" + t.get_string())
     65 
     66 

DataFramesNotEqualError: 
+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
|                                                          df1                                                           |                                                          df2                                                           |
+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
|   Row(wkt='POINT (2.5 1.5)', point_id=1.0, poly_id=2.0, distance=0.7071067811865476, X=3.0, Y=2.0, isOnRight=False)    |   Row(wkt='POINT (2.5 1.5)', point_id=1.0, poly_id=1.0, distance=0.7071067811865476, X=2.0, Y=2.0, isOnRight=False)    |
| Row(wkt='POINT (2.55 2.25)', point_id=2.0, poly_id=2.0, distance=0.14142135623730964, X=2.65, Y=2.35, isOnRight=False) | Row(wkt='POINT (2.55 2.25)', point_id=2.0, poly_id=2.0, distance=0.14142135623730964, X=2.65, Y=2.35, isOnRight=False) |
|          Row(wkt='POINT (4.75 2.5)', point_id=3.0, poly_id=3.0, distance=0.5, X=5.25, Y=2.5, isOnRight=False)          |          Row(wkt='POINT (4.75 2.5)', point_id=3.0, poly_id=3.0, distance=0.5, X=5.25, Y=2.5, isOnRight=False)          |
|           Row(wkt='POINT EMPTY', point_id=4.0, poly_id=None, distance=-999.0, X=nan, Y=nan, isOnRight=False)           |           Row(wkt='POINT EMPTY', point_id=4.0, poly_id=None, distance=-999.0, X=nan, Y=nan, isOnRight=False)           |
+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+

and the two schemas compared:

root
 |-- wkt: string (nullable = true)
 |-- point_id: double (nullable = true)
 |-- poly_id: double (nullable = true)
 |-- distance: double (nullable = true)
 |-- X: double (nullable = true)
 |-- Y: double (nullable = true)
 |-- isOnRight: boolean (nullable = true)

root
 |-- wkt: string (nullable = true)
 |-- point_id: double (nullable = true)
 |-- poly_id: double (nullable = true)
 |-- distance: double (nullable = true)
 |-- X: double (nullable = true)
 |-- Y: double (nullable = true)
 |-- isOnRight: boolean (nullable = true)

Answer 1 · 2021-12-17T18:40:04.000Z

This issue is probably related to the NaN values. I see now there is a allow_nan_equality=True for assert_df_equality. Is there an allow_nan_equality option for assert_approx_df_equality?

Answer 2 · 2021-12-17T18:45:57.000Z

duplicate for #28 and #29

Answer 3 · 2021-12-17T19:25:07.000Z

I am curious why this same assert passes in another environment. Also, why does increasing the precision from 0.001 to 1.0 pass?

Answer 4 · 2021-12-22T20:51:20.000Z

I see this error intermittently, for example in a notebook cell I can run it and pass, then re-run and fail. I switched from assert_approx_df_equality to assert_df_equality to take advantage of allow_nan_equality=True, ignore_row_order=True, ignore_column_order=True. I notice now the error message shows dataframes that are different than the two show commands I run before the assert.

point_data = [ 
        ('POINT (2.5 1.5)', 1.0),
        ('POINT (2.55 2.25)', 2.0),
        ('POINT (4.75 2.5)', 3.0),
        ('POINT EMPTY', 4.0),
        (None, 5.0)
     ]

point_df = (spark.createDataFrame(point_data, ["wkt", "point_id"])
     .selectExpr("ST_FromText(wkt) SHAPE", "point_id")
        .withMeta("POINT", 4326))

poly_data = [ 
        ('POLYGON ((0.5 0.5, 3.5 3.5, 1.75 2.75, 0.5 0.5))', 1.0),
        ('POLYGON ((1.5 3.5, 4.0 1.0, 3.0 3.0, 1.5 3.5))', 2.0),
        ('POLYGON ((5.25 0.5, 5.25 4.5, 5.26 4.5, 5.26 0.5, 5.25 0.5))', 3.0),
        ('POLYGON EMPTY', 4.0),
        (None, 5.0)
     ]

poly_df = (spark.createDataFrame(poly_data, ["wkt", "poly_id"])
    .selectExpr("ST_FromText(wkt) SHAPE", "poly_id")
    .withMeta("POLYGON", 4326))

actual_df = (...create my dataframe...)

expected_data = [ 
        ('POINT (2.5 1.5)', 1.0, 1.0, 0.7071067811865476, 2.0, 2.0, False),
        ('POINT (2.55 2.25)', 2.0, 2.0, 0.14142135623730964, 2.65, 2.35, False),
        ('POINT (4.75 2.5)', 3.0, 3.0, 0.5, 5.25, 2.5, False),
        ('POINT EMPTY', 4.0, None, -999.0, float('nan'), float('nan'), False)
     ]
expected_df = (spark.createDataFrame(expected_data, ["wkt", "point_id", "poly_id", "distance", "X", "Y", "isOnRight"])).sort("point_id")


actual_df.show()
expected_df.show()
#TODO: why does this fail intermittently when dataframes appear equal?
assert_df_equality(actual_df, expected_df, ignore_nullable=True, allow_nan_equality=True, ignore_row_order=True, ignore_column_order=True)

The result of the show commands

+-----------------+--------+-------+-------------------+----+----+---------+
|              wkt|point_id|poly_id|           distance|   X|   Y|isOnRight|
+-----------------+--------+-------+-------------------+----+----+---------+
|  POINT (2.5 1.5)|     1.0|    1.0| 0.7071067811865476| 2.0| 2.0|    false|
|POINT (2.55 2.25)|     2.0|    2.0|0.14142135623730964|2.65|2.35|    false|
| POINT (4.75 2.5)|     3.0|    3.0|                0.5|5.25| 2.5|    false|
|      POINT EMPTY|     4.0|   null|             -999.0| NaN| NaN|    false|
+-----------------+--------+-------+-------------------+----+----+---------+

+-----------------+--------+-------+-------------------+----+----+---------+
|              wkt|point_id|poly_id|           distance|   X|   Y|isOnRight|
+-----------------+--------+-------+-------------------+----+----+---------+
|  POINT (2.5 1.5)|     1.0|    1.0| 0.7071067811865476| 2.0| 2.0|    false|
|POINT (2.55 2.25)|     2.0|    2.0|0.14142135623730964|2.65|2.35|    false|
| POINT (4.75 2.5)|     3.0|    3.0|                0.5|5.25| 2.5|    false|
|      POINT EMPTY|     4.0|   null|             -999.0| NaN| NaN|    false|
+-----------------+--------+-------+-------------------+----+----+---------+

DataFramesNotEqualError: 
+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
|                                                          df1                                                           |                                                          df2                                                           |
+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
| Row(X=2.65, Y=2.35, distance=0.14142135623730964, isOnRight=False, point_id=2.0, poly_id=2.0, wkt='POINT (2.55 2.25)') |   Row(X=2.0, Y=2.0, distance=0.7071067811865476, isOnRight=False, point_id=1.0, poly_id=1.0, wkt='POINT (2.5 1.5)')    |
|   Row(X=3.0, Y=2.0, distance=0.7071067811865476, isOnRight=False, point_id=1.0, poly_id=2.0, wkt='POINT (2.5 1.5)')    | Row(X=2.65, Y=2.35, distance=0.14142135623730964, isOnRight=False, point_id=2.0, poly_id=2.0, wkt='POINT (2.55 2.25)') |
|          Row(X=5.25, Y=2.5, distance=0.5, isOnRight=False, point_id=3.0, poly_id=3.0, wkt='POINT (4.75 2.5)')          |          Row(X=5.25, Y=2.5, distance=0.5, isOnRight=False, point_id=3.0, poly_id=3.0, wkt='POINT (4.75 2.5)')          |
|           Row(X=nan, Y=nan, distance=-999.0, isOnRight=False, point_id=4.0, poly_id=None, wkt='POINT EMPTY')           |           Row(X=nan, Y=nan, distance=-999.0, isOnRight=False, point_id=4.0, poly_id=None, wkt='POINT EMPTY')           |
+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+

Answer 5 · 2021-12-22T21:24:35.000Z

I discovered with those inputs, the function I test is non-deterministic so the results in the actual_df were changing each run. That explains why the output of the .show() is different than the exception message. After correcting the inputs I still see inexplicable DataFramesNotEqual. I have simplified the example to rule out NaN comparison issues:


point_data = [ 
        ('POINT (2.5 1.75)', 1.0),
        ('POINT (2.55 2.25)', 2.0),
        ('POINT (4.75 2.5)', 3.0),
        (None, 5.0)
     ]

point_df = (spark.createDataFrame(point_data, ["wkt", "point_id"])
     .selectExpr("ST_FromText(wkt) SHAPE", "point_id")
        .withMeta("POINT", 4326))

poly_data = [ 
        ('POLYGON ((0.5 0.5, 3.5 3.5, 1.75 2.75, 0.5 0.5))', 1.0),
        ('POLYGON ((1.5 3.5, 4.0 1.0, 3.0 3.0, 1.5 3.5))', 2.0),
        ('POLYGON ((5.25 0.5, 5.25 4.5, 5.26 4.5, 5.26 0.5, 5.25 0.5))', 3.0),
        ('POLYGON EMPTY', 4.0),
        (None, 5.0)
     ]

poly_df = (spark.createDataFrame(poly_data, ["wkt", "poly_id"])
    .selectExpr("ST_FromText(wkt) SHAPE", "poly_id")
    .withMeta("POLYGON", 4326))

actual_df = (...call the function with inputs that produce deterministic results...)

expected_data = [ 
        ('POINT (2.5 1.75)', 1.0, 1.0, 0.5303300858899106, 2.125, 2.125, False),
        ('POINT (2.55 2.25)', 2.0, 2.0, 0.14142135623730964, 2.65, 2.35, False),
        ('POINT (4.75 2.5)', 3.0, 3.0, 0.5, 5.25, 2.5, False)
     ]
expected_df = (spark.createDataFrame(expected_data, ["wkt", "point_id", "poly_id", "distance", "X", "Y", "isOnRight"])).sort("point_id")

assert_df_equality(actual_df, expected_df, ignore_nullable=True, allow_nan_equality=False, ignore_row_order=True, ignore_column_order=True)

/databricks/python/lib/python3.7/site-packages/chispa/dataframe_comparer.py in assert_df_equality(df1, df2, ignore_nullable, transforms, allow_nan_equality, ignore_column_order, ignore_row_order)
     25         assert_generic_rows_equality(df1, df2, are_rows_equal_enhanced, [True])
     26     else:
---> 27         assert_basic_rows_equality(df1, df2)
     28 
     29 

/databricks/python/lib/python3.7/site-packages/chispa/dataframe_comparer.py in assert_basic_rows_equality(df1, df2)
     76             else:
     77                 t.add_row([r1, r2])
---> 78         raise DataFramesNotEqualError("\n" + t.get_string())

DataFramesNotEqualError: 
+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
|                                                          df1                                                           |                                                          df2                                                           |
+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+
| Row(X=2.65, Y=2.35, distance=0.14142135623730964, isOnRight=False, point_id=2.0, poly_id=2.0, wkt='POINT (2.55 2.25)') | Row(X=2.125, Y=2.125, distance=0.5303300858899106, isOnRight=False, point_id=1.0, poly_id=1.0, wkt='POINT (2.5 1.75)') |
| Row(X=2.875, Y=2.125, distance=0.5303300858899106, isOnRight=False, point_id=1.0, poly_id=2.0, wkt='POINT (2.5 1.75)') | Row(X=2.65, Y=2.35, distance=0.14142135623730964, isOnRight=False, point_id=2.0, poly_id=2.0, wkt='POINT (2.55 2.25)') |
|          Row(X=5.25, Y=2.5, distance=0.5, isOnRight=False, point_id=3.0, poly_id=3.0, wkt='POINT (4.75 2.5)')          |          Row(X=5.25, Y=2.5, distance=0.5, isOnRight=False, point_id=3.0, poly_id=3.0, wkt='POINT (4.75 2.5)')          |
+------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+

Answer 6 · 2021-12-22T21:33:17.000Z

I discovered another non-deterministic result of the function I'm testing. I'm going to close this issue at this point and thank chispa for helping me catch this behavior.