awslabs/deequ

Is Redshift supported as a data source?

jbleduigou opened this issue · 0 comments

Hello,

I have been testing Deequ.
So far I had mixed results when using Redshift as a datasource.

I am using Spark Redshift library in order to load a data frame from Redshift.

One example of the problems I had is with uniqueness verification of a column.
I get the following error:

java.sql.SQLException: Exception thrown in awaitResult: 
        at io.github.spark_redshift_community.spark.redshift.JDBCWrapper.executeInterruptibly(RedshiftJDBCWrapper.scala:172) ~[spark-redshift_2.12-6.1.0-spark_3.4.jar:6.1.0-spark_3.4]
        at io.github.spark_redshift_community.spark.redshift.JDBCWrapper.executeInterruptibly(RedshiftJDBCWrapper.scala:145) ~[spark-redshift_2.12-6.1.0-spark_3.4.jar:6.1.0-spark_3.4]
        at io.github.spark_redshift_community.spark.redshift.RedshiftRelation.UnloadDataToS3(RedshiftRelation.scala:328) ~[spark-redshift_2.12-6.1.0-spark_3.4.jar:6.1.0-spark_3.4]
        at io.github.spark_redshift_community.spark.redshift.RedshiftRelation.$anonfun$buildScanFromSQL$1(RedshiftRelation.scala:271) ~[spark-redshift_2.12-6.1.0-spark_3.4.jar:6.1.0-spark_3.4]
        at scala.Option.orElse(Option.scala:447) ~[scala-library-2.12.17.jar:?]
        at io.github.spark_redshift_community.spark.redshift.RedshiftRelation.buildScanFromSQL(RedshiftRelation.scala:271) ~[spark-redshift_2.12-6.1.0-spark_3.4.jar:6.1.0-spark_3.4]
        at io.github.spark_redshift_community.spark.redshift.pushdown.RedshiftScanExec$$anon$1.call(RedshiftScanExec.scala:53) ~[spark-redshift_2.12-6.1.0-spark_3.4.jar:6.1.0-spark_3.4]
        at io.github.spark_redshift_community.spark.redshift.pushdown.RedshiftScanExec$$anon$1.call(RedshiftScanExec.scala:49) ~[spark-redshift_2.12-6.1.0-spark_3.4.jar:6.1.0-spark_3.4]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
        at java.lang.Thread.run(Thread.java:829) ~[?:?]
Caused by: com.amazon.redshift.util.RedshiftException: ERROR: cannot cast type boolean to double precision
        at com.amazon.redshift.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2613) ~[redshift-jdbc42-2.1.0.23.jar:?]
        at com.amazon.redshift.core.v3.QueryExecutorImpl.processResultsOnThread(QueryExecutorImpl.java:2281) ~[redshift-jdbc42-2.1.0.23.jar:?]
        at com.amazon.redshift.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1886) ~[redshift-jdbc42-2.1.0.23.jar:?]
        at com.amazon.redshift.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1878) ~[redshift-jdbc42-2.1.0.23.jar:?]

The data itself is the Sample Database provided by AWS.
The verification code is as follows:

    val verificationResult = VerificationSuite()
      .onData(df)
      .addCheck(
        Check(CheckLevel.Error, "Data Quality Checks")
          .isUnique("eventid")
      )
      .run()

Is using Redshift as a datasource supported by Deequ?