Check isContainedIn does not recognize string in quotes as allowed value
markushc opened this issue · 2 comments
markushc commented
Steps to reproduce:
Run unit test shown below.
import org.scalatest.flatspec.AnyFlatSpec
import org.apache.spark.sql.SparkSession
import com.amazon.deequ.VerificationSuite
import com.amazon.deequ.checks.Check
import com.amazon.deequ.checks.CheckLevel
import com.amazon.deequ.checks.CheckStatus
private case class SomeData(data: String)
class IsContainedInQuotesTest extends AnyFlatSpec {
it should "accept string in quotes as allowed value" in {
val someData = Seq("'a'")
val spark = SparkSession.builder().master("local[*]").getOrCreate()
val df = spark.createDataFrame(someData.map(SomeData))
val verificationResult = VerificationSuite()
.onData(df)
.addCheck(
Check(CheckLevel.Error, "myCheck")
.isContainedIn("data", someData.toArray)
)
.run()
assert(verificationResult.status == CheckStatus.Success)
}
}
Expected outcome: Unit test passes, because 'a'
is in the list of allowed values and 'a'
is the value of the column being checked.
Actual outcome: Unit test fails. It seems this could be related to the allowed values having quotes.
Deequ version: 2.0.3-spark-3.3
Java version: 11
mentekid commented
Thanks for reporting this. We will look into it.
marcantony commented
To add some context to this, it looks like the isContainedIn
check is trying to escape single quotes in the allowed values list by replacing '
with ''
:
Although this works with standard SQL, it seems like a \
needs to be used in Spark SQL: https://spark.apache.org/docs/latest/sql-ref-literals.html#parameters.