awslabs/deequ

Check isContainedIn does not recognize string in quotes as allowed value

markushc opened this issue · 2 comments

Steps to reproduce:

Run unit test shown below.

import org.scalatest.flatspec.AnyFlatSpec
import org.apache.spark.sql.SparkSession
import com.amazon.deequ.VerificationSuite
import com.amazon.deequ.checks.Check
import com.amazon.deequ.checks.CheckLevel
import com.amazon.deequ.checks.CheckStatus

private case class SomeData(data: String)

class IsContainedInQuotesTest extends AnyFlatSpec {
  it should "accept string in quotes as allowed value" in {
    val someData = Seq("'a'")
    val spark = SparkSession.builder().master("local[*]").getOrCreate()
    val df = spark.createDataFrame(someData.map(SomeData))
    val verificationResult = VerificationSuite()
      .onData(df)
      .addCheck(
        Check(CheckLevel.Error, "myCheck")
          .isContainedIn("data", someData.toArray)
      )
      .run()
    assert(verificationResult.status == CheckStatus.Success)
  }
}

Expected outcome: Unit test passes, because 'a' is in the list of allowed values and 'a' is the value of the column being checked.

Actual outcome: Unit test fails. It seems this could be related to the allowed values having quotes.

Deequ version: 2.0.3-spark-3.3

Java version: 11

Thanks for reporting this. We will look into it.

To add some context to this, it looks like the isContainedIn check is trying to escape single quotes in the allowed values list by replacing ' with '':

.map { _.replaceAll("'", "''") }

Although this works with standard SQL, it seems like a \ needs to be used in Spark SQL: https://spark.apache.org/docs/latest/sql-ref-literals.html#parameters.