MrPowers/mack

Brainstorm data quality features

Opened this issue · 6 comments

Constraints are great data quality features that allow users to define filters / rules that identify invalid records.
But the only allow for fail on invalid records and I think we could do better.

Some ideas:

  • Ability to automatically drop invalid rows
  • Ability to automatically mark rows as invalid in target table
    - Ability to write invalid rows to "Quarantine" table

WDYT @MrPowers

@robertkossendey - these sound like good suggestions. I'm guessing some external libs would help for this type of functionality (Great Expectations or PyDeequ perhaps), but don't want to add any dependencies to this lib. Let's keep this open as a "meta-issue". When you have ideas for individual functions, feel free to open up a separate issue and we can chat in detail before you put in the work. Thanks!

@MrPowers I wouldn't like to use any other framework tbh. If you're okay with it I would create a PoC PR that allows you to specify a condition and if that condition is not fulfilled a write would fail.

@robertkossendey - yep, PoC PR sounds like a great next step!

@robertkossendey @MrPowers

Hey guys I actually had built a library to mock the dlt behaviors outside of databricks: dlt-with-debug

I think I can take out the expectation mock apis and add them here in mack.

@souvik-databricks very cool! Maybe you can open up a PR and we can collaborate on that then :)

I will raise the PR on this @robertkossendey