Incorporate referential integrity and data synchronization checks into Deequ's VerificationSuite
rdsharma26 opened this issue · 6 comments
The following two utilities should be part of Deequ's verification suite.
@rdsharma26 could you add more details on what enhancement we are looking at here, I could take a stab at implementation. thanks
Hi,
Here's some context for this issue. Deequ lets you run checks on your data by constructing a VerificationSuite object. You can build a VerificationSuite
by calling onData()
, which then exposes the addCheck()
method. You can call that repeatedly to add multiple checks to your suite, for example:
val verificationResult = VerificationSuite()
.onData(data)
.addCheck(Check(CheckLevel.Error, "must have 5 rows").hasSize(_ == 5)
.addCheck(Check(CheckLevel.Error, "must have no nulls").isComplete("id")
.run()
All checks within the same Verification suite are processed before Spark is called, and Deequ comes up with a plan to calculate all the necessary statistics without making unnecessary passes over the data.
However, for the comparison operations @rdsharma26 is mentioning above, there is no Check
object and therefore they cannot be added to a VerificationSuite
. This means two things:
- There is no way for Deequ to optimize the execution of these comparisons
- The code to run any of these checks looks incongruent with any other Deequ syntax, as you need to directly invoke methods rather than constructing a cohesive suite of tests.
The ask here is to merge the two operations with the standard Deequ APIs, so a user can create a verification suite that contains a mix of cross-dataset and in-dataset tests. This will probably require a bit of refactoring in the VerificationRunBuilder, because unlike any Check
s we have today, the cross-dataset checks require an additional reference dataset in addition to the primary dataset (passed using onData()
, which returns a VerificationRunBuilder
).
Let me know if that's not clear or you have any follow-up questions.
@mentekid thanks, I can take a stab at this, will circle back once PR is ready.
Hello @VenkataKarthikP and @rdsharma26 is there any update on implementation of ReferentialIntegrity check as well?
@chaurasiya I have plans to do it, will open a PR.
Hello @VenkataKarthikP and @rdsharma26, is it still in plan to include Referential Integrity check?