Is Redshift supported as a data source?
jbleduigou opened this issue · 0 comments
jbleduigou commented
Hello,
I have been testing Deequ.
So far I had mixed results when using Redshift as a datasource.
I am using Spark Redshift library in order to load a data frame from Redshift.
One example of the problems I had is with uniqueness verification of a column.
I get the following error:
java.sql.SQLException: Exception thrown in awaitResult:
at io.github.spark_redshift_community.spark.redshift.JDBCWrapper.executeInterruptibly(RedshiftJDBCWrapper.scala:172) ~[spark-redshift_2.12-6.1.0-spark_3.4.jar:6.1.0-spark_3.4]
at io.github.spark_redshift_community.spark.redshift.JDBCWrapper.executeInterruptibly(RedshiftJDBCWrapper.scala:145) ~[spark-redshift_2.12-6.1.0-spark_3.4.jar:6.1.0-spark_3.4]
at io.github.spark_redshift_community.spark.redshift.RedshiftRelation.UnloadDataToS3(RedshiftRelation.scala:328) ~[spark-redshift_2.12-6.1.0-spark_3.4.jar:6.1.0-spark_3.4]
at io.github.spark_redshift_community.spark.redshift.RedshiftRelation.$anonfun$buildScanFromSQL$1(RedshiftRelation.scala:271) ~[spark-redshift_2.12-6.1.0-spark_3.4.jar:6.1.0-spark_3.4]
at scala.Option.orElse(Option.scala:447) ~[scala-library-2.12.17.jar:?]
at io.github.spark_redshift_community.spark.redshift.RedshiftRelation.buildScanFromSQL(RedshiftRelation.scala:271) ~[spark-redshift_2.12-6.1.0-spark_3.4.jar:6.1.0-spark_3.4]
at io.github.spark_redshift_community.spark.redshift.pushdown.RedshiftScanExec$$anon$1.call(RedshiftScanExec.scala:53) ~[spark-redshift_2.12-6.1.0-spark_3.4.jar:6.1.0-spark_3.4]
at io.github.spark_redshift_community.spark.redshift.pushdown.RedshiftScanExec$$anon$1.call(RedshiftScanExec.scala:49) ~[spark-redshift_2.12-6.1.0-spark_3.4.jar:6.1.0-spark_3.4]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at java.lang.Thread.run(Thread.java:829) ~[?:?]
Caused by: com.amazon.redshift.util.RedshiftException: ERROR: cannot cast type boolean to double precision
at com.amazon.redshift.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2613) ~[redshift-jdbc42-2.1.0.23.jar:?]
at com.amazon.redshift.core.v3.QueryExecutorImpl.processResultsOnThread(QueryExecutorImpl.java:2281) ~[redshift-jdbc42-2.1.0.23.jar:?]
at com.amazon.redshift.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1886) ~[redshift-jdbc42-2.1.0.23.jar:?]
at com.amazon.redshift.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1878) ~[redshift-jdbc42-2.1.0.23.jar:?]
The data itself is the Sample Database provided by AWS.
The verification code is as follows:
val verificationResult = VerificationSuite()
.onData(df)
.addCheck(
Check(CheckLevel.Error, "Data Quality Checks")
.isUnique("eventid")
)
.run()
Is using Redshift as a datasource supported by Deequ?