awslabs/deequ

Add support for Spark 3.2

alexott opened this issue · 13 comments

As part of "[SPARK-35558] Optimizes for multi-quantile retrieval", Spark 3.2 changed the signature of ApproximatePercentile.getPercentiles function and this broke the Deequ:

NoSuchMethodError: org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile$PercentileDigest.getPercentiles([D)[D
	at com.amazon.deequ.analyzers.ApproxQuantile.fromAggregationResult(ApproxQuantile.scala:84)
	at com.amazon.deequ.analyzers.ScanShareableAnalyzer.metricFromAggregationResult(Analyzer.scala:192)
	at com.amazon.deequ.analyzers.ScanShareableAnalyzer.metricFromAggregationResult$(Analyzer.scala:185)
	at com.amazon.deequ.analyzers.ApproxQuantile.metricFromAggregationResult(ApproxQuantile.scala:50)
	at com.amazon.deequ.analyzers.runners.AnalysisRunner$.successOrFailureMetricFrom(AnalysisRunner.scala:362)
	at com.amazon.deequ.analyzers.runners.AnalysisRunner$.$anonfun$runScanningAnalyzers$5(AnalysisRunner.scala:330)

Thanks for sharing the issue. My understanding is that Spark 3.2 is not yet released. We'll add support for Spark 3.2 when it is released.

As an update, Spark 3.2 has been released - are there any plans to support this?

Yes. We're working on the release.

Yes. We're working on the release.

Hi TammoR, do you have a timeline on when the support can be released. If you have a specific branch that has the support for 3.2. Please provide the link to that

Hi sdandey. We're working on it on this branch: https://github.com/awslabs/deequ/tree/tammruka/2.0.0-spark-3.2.0
We do have limited bandwidth at the moment. If you're able to contribute to this branch towards supporting Spark 3.2, you would be most welcome to.

Hi @TammoR

Correlation analyzer failing in tammruka/2.0.0-spark-3.2.0 #399

Is there any update on this? Just checked out the branch and hit with this issue:

[ERROR] ## Exception when compiling 107 sources to /home/joesan/Projects/Private/scala-projects/deequ/target/classes
java.io.IOException: Cannot run program "/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/javac" (in directory "/home/joesan/Projects/Private/scala-projects/deequ"): error=2, No such file or directory

Managed to fix few of the errors. Left with the following after updading pom.xml to Scala version 2.13 and Spark version to 3.2.

[INFO] compiling 106 Scala sources and 1 Java source to /home/joesan/Projects/Private/scala-projects/deequ/target/classes ...
[ERROR] /home/joesan/Projects/Private/scala-projects/deequ/src/main/scala/com/amazon/deequ/analyzers/QuantileNonSample.scala:55: missing argument list for method to in trait IterableOnceOps
Unapplied methods are only converted to functions when a function type is expected.
You can make this conversion explicit by writing `to _` or `to(_)` instead of `to`.
[ERROR] one error found
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  11.448 s
[INFO] Finished at: 2021-12-19T21:47:10+01:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:4.5.6:compile (scala-compile-first) on project deequ: Execution scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:4.5.6:compile failed: Compilation failed: InterfaceCompileFailed -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException

tanvn commented

Hi @TammoR and @joesan
JFYI, spark 3.2.1 has been released.

And if we use spark 3.2.1 and skip tests, we can compile and produce an artifact.

First, change the spark version from 3.2.0 to 3.2.1 in the below line
https://github.com/awslabs/deequ/blob/tammruka/2.0.0-spark-3.2.0/pom.xml#L21

After that, the scalastyle check will make the build fail

error file=/Users/JP28431/deequ/src/main/scala/com/amazon/deequ/analyzers/catalyst/StatefulApproxQuantile.scala message=File line length exceeds 100 characters line=34
error file=/Users/JP28431/deequ/src/main/scala/com/amazon/deequ/analyzers/catalyst/StatefulApproxQuantile.scala message=File line length exceeds 100 characters line=119

To fix (or bypass) this, you can use scalafmt to format the code or change the following settings of scalastyle plugin to false
https://github.com/awslabs/deequ/blob/tammruka/2.0.0-spark-3.2.0/pom.xml#L215-L216

Then build with mvn clean install -DskipTests and the build has finished successfully.
I had to skip the test because some tests are failing. I think if we fix the tests, the issue can be solved.

スクリーンショット 2022-02-02 0 27 01

tanvn commented

Hi @TammoR
I just created a PR for this issue
#416
Could you please take a look ?
It seems that I do not have the privilege of setting Reviewers and Assignees, so I would appreciate if you could take care of that part too 🙇

tanvn commented

Hi @TammoR
Thank you for merging the PR!
May I ask if there any blocker for Spark 3.2 ?
I would be very grateful if you could share the current status of this issue 🙇

Hi @tanvn
Thanks for the great work on this issue! We merged the code with master.

A new Deequ version 2.0.1-spark-3.2 is now available.

Hi @lange-labs @TammoR
It looks like this ticket is still open, and now we're on Spark 3.3.0. is this a fluke? Are we good to add compatibility ask for 3.3.0?