Show results when running 1000 rules are not in order

Question

Show results when running 1000 rules are not in order

Closed this issue 2 years ago · 2 comments

I appears that when running a test with 1000 rules.
Test scenario is with the Taxi NYC data set with 20M Rows.

df = spark.read.parquet("temp/data/*.parquet")
c = Check(CheckLevel.Warning, "NYC")
for i in range(1000):
  c.is_greater_than("fare_amount", i)
c.validate(spark, df).show(n=1000, truncate=False)

# Displayed dataframe contains wrong order in rows
# in 995 there is a discrepancy because 10% of the rows are certainly not with `fare_amount > 995`

Answer 1 · 2022-10-15T01:10:21.000Z

Fixed during integration of split computation

Answer 2 · 2022-10-15T01:10:33.000Z

Done in pre-release compute method split.