awslabs/deequ

Malformed JARs & non-standard naming

MrPowers opened this issue ยท 16 comments

Thanks for building this project. Very valuable for the community.

Some Scala conventions aren't being properly followed, so the Spark 3 JAR files are unusable for a lot of users (should work for ppl attaching the JAR directly to a cluster, but won't work as a libraryDependency).

spark-daria_2.12-0.38.2.jar is the standard way to name a JAR. The standard way to name the Deequ JAR would be deequ_2.12-1.1.0.jar (not deequ-1.1.0_spark-3.0-scala-2.12.jar).

The best way to release Spark JARs is to cross compile with different Scala versions for certain Spark versions & periodically drop Scala versions as Spark stops supporting them, specifically:

  • Spark 2.3 JAR file with Scala 2.11
  • Spark 2.4 JAR files cross compiled with both Scala 2.11 and Scala 2.12
  • Spark 3 JAR files with Scala 2.12

The Spark 3 JAR files currently depend on Scala 2.11 libs, so dependency resolution will be impossible with SBT.

Screen Shot 2021-03-21 at 8 55 42 AM

It should be relatively easy to switch this project to standard publishing and get it working for the community. We can just copy the best practices from projects like Delta Lake. Here's what the Maven artifacts should look like:

Screen Shot 2021-03-21 at 9 26 29 AM

Thanks again for building this excellent project.

Hi. Thanks for raising this point, we are happy to implement the changes needed to make the project usable again!

Can you please clarify how we could distinguish between the different spark versions? In your example, the project version (e.g. 1.1.0) and the scala version (e.g. 2.11) are specified. There are multiple Spark versions that depend on e.g. Scala 2.11, how could we deal with this ambiguity?
Thanks!

@twollnik - Thanks for the help and good question!

Let's imagine some of the Scala / Spark dependency combinations of some Deequ users:

  • Scala 2.11 / Spark 2.4.5
  • Scala 2.12 / Spark 2.4.6
  • Scala 2.11 / Spark 2.4.5 & Scala 2.12 / Spark 2.4.5 (user that's cross compiling)
  • Scala 2.12 / Spark 3.0.1

There should be a single Deequ SBT import statement that'll work for all these users. Here's the spark-daria import statement that works for all these users:

libraryDependencies += "com.github.mrpowers" %% "spark-daria" % "0.39.0"

spark-daria follows the Scala JAR file naming conventions and the SBT %% operator automatically picks the JAR file that was compiled with the project Scala version.

spark-daria v0.39.0 was the last version that supported Scala 2.11. After Spark 3 was out for a few months, I bumped the spark-daria Spark version to v3.0.1 and started only publishing Scala 2.12 JAR files.

Once a Spark version that supports Scala 2.13 is released, I'll start cross compiling spark-daria again (with Scala 2.12 and Scala 2.13 JAR files).

I make conscious decisions in the spark-daria code to write Spark code that'll work with multiple Spark versions. The regexp_extract_all SQL function will be added to Spark 3.2 and I would not include that in spark-daria because that'd be a breaking change for certain library users. When spark-daria is using Spark 3.2, I'll still want to publish new JAR files that work for Spark 3.0.1 users.

Now that spark-daria has completely dropped support for Spark 2, I can finally start using new Spark features that were added in Spark 2.4 (e.g. array_join). I wouldn't not have used this function when Spark 2 was still supported cause that would have been breaking for Spark 2.3 users.

Bumping the project major version is the best way to indicate when a project will no longer support a Spark major version. spark-daria 0.x supports Spark 2 and spark-daria 1.x supports Spark 3. When Spark 4 comes out, I'll bump to spark-daria 2.x.

Here are my recommendations:

  • Make a Deequ 1.2 release that uses Spark 2.4.5 and is cross compiled with Scala 2.11 & Scala 2.12 (this should follow standard Scala JAR file naming conventions)
  • Then bump Deequ's Spark version to 3.0.1, drop Scala 2.11, and make a Deequ 1.3 release (still following Scala JAR file naming conventions)
  • Once Spark starts supporting Scala 2.13, start cross compiling again
  • Make the Deequ 2.x release when Spark 4 is released

Let me know if there is anything else I can do to help. It took me years to figure out how to properly maintain Spark libraries and I've also gone down the path of trying to build a 3 dimensional Scala / Spark / project release process (it didn't go well). I know how hard this is. @alexott - please let us know if you have anything else to add / if you agree.

Thanks so much for the in-depth explanation! I will follow up with the team and get back to you in case of futher questions :)

completely agree @MrPowers

Hi @MrPowers,

thanks again for your input on this. We don't want to drop Spark 2.4 support any time soon, because we have existing jobs that use spark 2.4 and can't be migrated to spark 3. We would like to have the option to make use of new deequ features in our existing jobs as they become available.

We would suggest the following:

  • We keep the spark version in the deequ version name, e.g. "1.2.0-spark-3.0" or "1.2.0-spark-2.4"
  • We change the jar naming to the following: "deequ_2.12-1.2.0-spark-3.0" or "deequ_2.12-1.2.0-spark-2.4". This should be sbt compatible, e.g. libraryDependencies += "com.amazon.deequ" %% "deequ" % "1.2.0-spark-3.0"

Does this work for you and what are your thoughts on this approach?

@twollnik - your suggested approach should work ๐Ÿ˜„

Let me know when you publish those updated JARs and I can do some testing on my end to make sure everything is working.

Thanks for following up and maintaining this project!!

Thanks so much! ETA is end of this week or early next week.

Hi @MrPowers,(cc @tdhd)

I performed a couple of releases:

deequ version artifact name
1.2.1-spark-3.0 deequ_2.12-1.2.1-spark-3.0.jar
1.2.1-spark-2.4 deequ_2.12-1.2.1-spark-2.4.jar
1.2.1-spark-2.3 deequ_2.11-1.2.1-spark-2.3.jar
1.2.1-spark-2.2 deequ_2.11-1.2.1-spark-2.2.jar

One issue that I came across is that maven does not support publishing multiple artifacts for the same version name. So the idea that we had before does not seem to work as desired (we would need two artifacts for the version "1.2.1-spark-2.4"). Anyway, please let me know if you can import the artifacts that I did publish as expected and then we'll think about how to support multiple Scala versions for the same Spark version (maybe this isn't even necessary as scala 2.11 support is officially deprecated since spark 2.4.1 and the current deequ release uses spark 2.4.7). Please note that it can take a couple of days for the artifacts to become available on maven central (link to maven central repository). Thanks again for being so active on this issue!

Edit: I will submit at PR with README changes once we have established that the releases can be imported in the desired way.

I can see the JAR file here.

Screen Shot 2021-04-21 at 7 12 16 PM

Created an example app and am trying to access the lib like this:

libraryDependencies += "com.amazon.deequ" % "deequ" % "1.2.1-spark-3.0"

It's not quite working, but I'm pretty tired, so I might be messing something up. Wanted to drop you a link to the example repo, so you can easily clone it and do experimentation on your end ;)

Hi Matthew,

Thanks so much for setting up the example project and for testing the import! I'm getting errors for the spark-daria import, the deequ import seems to work fine. Can you confirm this finding?

Best, Tom

@twollnik - I'm not getting any errors with the spark-daria import and sbt test works when I comment out the deequ import and the VerificationChecksSpec file. Here's the error I'm getting when the Deequ stuff is included:

~/D/c/m/deequ-example โฏโฏโฏ sbt test
[info] welcome to sbt 1.4.9 (AdoptOpenJDK Java 1.8.0_272)
[info] loading global plugins from /Users/powers/.sbt/1.0/plugins
[info] loading settings for project deequ-example-build from plugins.sbt ...
[info] loading project definition from /Users/powers/Documents/code/my_apps/deequ-example/project
[info] loading settings for project deequ-example from build.sbt ...
[info] set current project to deequ-example (in build file:/Users/powers/Documents/code/my_apps/deequ-example/)
[info] Updating 
[info] Resolved  dependencies
[warn] 
[warn] 	Note: Unresolved dependencies path:
[error] sbt.librarymanagement.ResolveException: Error downloading org.apache.spark:spark-sql_:
[error]   Not found
[error]   Not found
[error]   not found: /Users/powers/.ivy2/localorg.apache.spark/spark-sql_/ivys/ivy.xml
[error]   not found: https://repo1.maven.org/maven2/org/apache/spark/spark-sql_//spark-sql_-.pom
[error] Error downloading org.apache.spark:spark-core_:
[error]   Not found
[error]   Not found
[error]   not found: /Users/powers/.ivy2/localorg.apache.spark/spark-core_/ivys/ivy.xml
[error]   not found: https://repo1.maven.org/maven2/org/apache/spark/spark-core_//spark-core_-.pom
[error] Error downloading org.scalanlp:breeze_:0.13.2
[error]   Not found
[error]   Not found
[error]   not found: /Users/powers/.ivy2/localorg.scalanlp/breeze_/0.13.2/ivys/ivy.xml
[error]   not found: https://repo1.maven.org/maven2/org/scalanlp/breeze_/0.13.2/breeze_-0.13.2.pom
[error] 	at lmcoursier.CoursierDependencyResolution.unresolvedWarningOrThrow(CoursierDependencyResolution.scala:258)
[error] 	at lmcoursier.CoursierDependencyResolution.$anonfun$update$38(CoursierDependencyResolution.scala:227)
[error] 	at scala.util.Either$LeftProjection.map(Either.scala:573)
[error] 	at lmcoursier.CoursierDependencyResolution.update(CoursierDependencyResolution.scala:227)
[error] 	at sbt.librarymanagement.DependencyResolution.update(DependencyResolution.scala:60)
[error] 	at sbt.internal.LibraryManagement$.resolve$1(LibraryManagement.scala:53)
[error] 	at sbt.internal.LibraryManagement$.$anonfun$cachedUpdate$12(LibraryManagement.scala:103)
[error] 	at sbt.util.Tracked$.$anonfun$lastOutput$1(Tracked.scala:73)
[error] 	at sbt.internal.LibraryManagement$.$anonfun$cachedUpdate$20(LibraryManagement.scala:116)
[error] 	at scala.util.control.Exception$Catch.apply(Exception.scala:228)
[error] 	at sbt.internal.LibraryManagement$.$anonfun$cachedUpdate$11(LibraryManagement.scala:116)
[error] 	at sbt.internal.LibraryManagement$.$anonfun$cachedUpdate$11$adapted(LibraryManagement.scala:97)
[error] 	at sbt.util.Tracked$.$anonfun$inputChangedW$1(Tracked.scala:219)
[error] 	at sbt.internal.LibraryManagement$.cachedUpdate(LibraryManagement.scala:130)
[error] 	at sbt.Classpaths$.$anonfun$updateTask0$5(Defaults.scala:3525)
[error] 	at scala.Function1.$anonfun$compose$1(Function1.scala:49)
[error] 	at sbt.internal.util.$tilde$greater.$anonfun$$u2219$1(TypeFunctions.scala:62)
[error] 	at sbt.std.Transform$$anon$4.work(Transform.scala:68)
[error] 	at sbt.Execute.$anonfun$submit$2(Execute.scala:282)
[error] 	at sbt.internal.util.ErrorHandling$.wideConvert(ErrorHandling.scala:23)
[error] 	at sbt.Execute.work(Execute.scala:291)
[error] 	at sbt.Execute.$anonfun$submit$1(Execute.scala:282)
[error] 	at sbt.ConcurrentRestrictions$$anon$4.$anonfun$submitValid$1(ConcurrentRestrictions.scala:265)
[error] 	at sbt.CompletionService$$anon$2.call(CompletionService.scala:64)
[error] 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[error] 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
[error] 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[error] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[error] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[error] 	at java.lang.Thread.run(Thread.java:748)
[error] (update) sbt.librarymanagement.ResolveException: Error downloading org.apache.spark:spark-sql_:
[error]   Not found
[error]   Not found
[error]   not found: /Users/powers/.ivy2/localorg.apache.spark/spark-sql_/ivys/ivy.xml
[error]   not found: https://repo1.maven.org/maven2/org/apache/spark/spark-sql_//spark-sql_-.pom
[error] Error downloading org.apache.spark:spark-core_:
[error]   Not found
[error]   Not found
[error]   not found: /Users/powers/.ivy2/localorg.apache.spark/spark-core_/ivys/ivy.xml
[error]   not found: https://repo1.maven.org/maven2/org/apache/spark/spark-core_//spark-core_-.pom
[error] Error downloading org.scalanlp:breeze_:0.13.2
[error]   Not found
[error]   Not found
[error]   not found: /Users/powers/.ivy2/localorg.scalanlp/breeze_/0.13.2/ivys/ivy.xml
[error]   not found: https://repo1.maven.org/maven2/org/scalanlp/breeze_/0.13.2/breeze_-0.13.2.pom
[error] Total time: 1 s, completed Apr 26, 2021 7:53:39 AM

I will try to get someone else to try on their machine and see what they get ๐Ÿ™ƒ

No need, I think I found the reason. Thanks for the quick follow-up.

Hi @MrPowers,
I updated the pom.xml and performed releases for Spark 2.2.x to 3.0.x. The new releases are called 1.2.2-spark-... I will check whether the releases are importable on Monday. I also opened a PR with README changes and the changes to the pom.xml: #361. If you have the time, it would be great if you could take a look.

@twollnik - Deequ version 1.2.2-spark-3.0 is working on my end! ๐ŸŽŠ ๐Ÿฅณ

I can add the dependency to the build.sbt file and run sbt test in the deequ-example project.

Now I'm ready to set out for my original objective - writing a blog post on how to use Deequ ๐Ÿ˜‰

nfx commented

can we also make sure that releases are on github as well? easier to subscribe to notifications and read changelogs.

@MrPowers This is great, thanks so much for testing the release and also thanks for all your input and support along the way! I am looking forward to reading your blog post :)

@nfx absolutely! I will create the releases when merging the PR #361