MrPowers/spark-daria

Should we add more validators?

MrPowers opened this issue · 12 comments

I'm watching this talk and they're discussing how they run validations that are specified in YAML files.

Would it be useful to add more validations to spark-daria? Should we have a DariaValidator.validatesLengthIsLessThan("my_cool_column", 5) and DariaValidator.validatesBetween("my_integers", 3, 10) methods?

BTW, all of the DataFrameValidator methods were copied over to the DariaValidator object due to a weird bug I ran into when using SBT shading with traits. Let me know if you're interested in understanding more about the SBT shading weirdness I uncovered.

We'll also need to think about what to do when we have data that doesn't satisfy the validations. Do we throw an error? Filter the invalid rows? Thoughts @gorros @afranzi @nvander1

@MrPowers usually I start with minimal amount of validation, and let Spark fail, and then after getting familiar with possible impurities in data I start to add validation and fixes.I usually filter invalid rows if it is time-series data. But if it row need to be processed I put them aside.

An approach I've thought about is saving row-level metadata about all the transformations / validations applied to that row and the status of the validations.

| name     | age | country | validations_performed                                                                 |
|----------|-----|---------|---------------------------------------------------------------------------------------|
| Alice    | 23  | US      | [{"name": "positiveAge", "valid": "true"}, {"name": "countryCode", "valid": "true"}]  |
| Bob      | -2  | US      | [{"name": "positiveAge", "valid": "false"}, {"name": "countryCode", "valid": "true"}] |
| Carol    | 24  | uS.     | [{"name": "positiveAge", "valid": "true"}, {"name": "countryCode", "valid": false}]   |
| Danielle | 30  | US      | [{"name": "positiveAge", "valid": "true"}, {"name": "countryCode", "valid": true}]    |

Then you can filter based on the metadata. Another option would be to filter data into two sets after each validation.

val inputDF = .. 
val (passedStage1DF, failedStage1DF) = validate1(inputDF)
val (passedStage2DF, failedStage2DF) = validate2(passedStage1DF)

You could generalize this to obtain a Seq of (passedStageDF, failedStageDF) s.

I think throwing in error would be the wrong call if only a small percentage of the rows failed the validation. You could maybe have a threshold of rows that need to pass before throwing error. But still saving the failed results and passed results seems like a good idea.

@nvander1 - I like the idea of adding a validations_performed column with all the validations that were performed. Should it look something like this?

df.transform(
  DariaValidators.withValidationsPerformed(
    List(
      positiveAge("name"), 
      validCountryCode("country")
    )
  )
)

Here is the Target data-validator project BTW.

@MrPowers

For the above example, I'm assuming signatures like this:

def positiveAge(column: Column): Column = ???
def validCountryCode(column: Column): Column = ???

Where the output columns are boolean columns.

Is that what you had in mind?

Also I think there is some overlap with the check feature from spark-constraints, although the mechanism for reporting validation/constraint violations would be different, and semantics would be a little different since here it's like "annotating" each row with a status of validations, while spark-constraints currently only supports listing the rows from a check constraint as violations, but not tagging them in the row.

I think they are different enough semantically to use both in different domains however. Kind of like how spark itself offers similar apis via RDD, DataFrame, and Dataset.

What do you think?

Also if we go down the rabbit hole of row-level metadata, why stop at validations? I could also see use for tracking where data came from if records are merged from multiple data sources (ie project gutenberg, google books, pubmed, ...)

{
  validationsPerformed: [ { "name": ..., "status" ....}, .... ],
  column_sources: [ {"title": "project gutenberg", "year_published": "google books", ... }]
}

@nvander1 - I agree with "I think they are different enough semantically to use both in different domains however".

I think we have enough of a vision forward to start writing some code here. I think you're more qualified than I am for this work, so let me know if you're up to write the code.

If not, no big deal, I'll take a stab at it and send you the PR for your review / comments. Let me know how you'd like to proceed!

FYI, https://github.com/awslabs/deequ also looks interesting

@manuzhang - deequ looks really interesting, thanks for pointing that out.

@nvander1 - definitely take a look at deequ if you haven't already: https://github.com/awslabs/deequ

Ooh deequ sounds awesome.

Closing due to inactivity.

@MrPowers you may try https://github.com/actions/stale to automate this