awslabs/deequ

[FEATURE] Cross-building via Mill

nightscape opened this issue · 5 comments

Is your feature request related to a problem? Please describe.
Currently, Deequ only gets published for a small subset of combinations of Scala version ✖️ Spark version.
People are running into issues, or trying to get their PRs with version changes in (which of course would hurt people using different version combinations):

Describe the solution you'd like
I'm the maintainer of spark-excel, a Spark library for reading and writing Excel files.
I'm using Mill to build the library against an extensive set of combinations of Spark and Scala versions.
The corresponding build definition currently has ~160 lines of code, which is less than half of Deequ's pom.xml.

Li Haoyi, the inventor of Mill has published a very interesting blog post on why he developed Mill.
From a user perspective, Mill is nice because

  • it has a very simple and familiar mental model (just traits, objects and defs with a special T class doing all the magic)
  • it allows writing custom tasks with a minimal amount of overhead
  • it compiles code faster than most other solutions
  • it has cross-building built in, not only for Scala versions, but along any dimension you want.
  • it is actively maintained

Would it be thinkable to give Mill a try for cross-building Deequ to a wider set of Scala and Spark versions?
If so, what constraints would have to be met?
I would be willing to create a PR for this if there's a realistic chance to get it merged.

Describe alternatives you've considered
The net.alchim31.maven scala-maven-plugin does not support cross-building directly. There are alternatives and extensions though, e.g.

hygt commented

I have an internal fork at work that is essentially doing this for a small matrix of Spark and Scala versions. It would be great if this was done upstream.

I don't have a strong opinion on Mill vs sbt, but I also feel Maven is clearly inadequate here (and in the Spark ecosystem at large).

Both SBT and Mill work much better for Scala projects than Maven does.
SBT is ok for simple things, or if you invest a lot of time into understanding its underlying model.
Mill just works out of the box, and even though I only spent about ¼ of the time with it compared to SBT, I'd already consider myself more proficient and productive with it than I ever did with SBT.

hygt commented

I use both, but sbt builds are still better supported by IntelliJ IDEA, and the toolchain is easier to bootstrap in a corporate environment with proxies and Maven mirrors. This point is important to people who aren't that deeply invested in the Scala ecosystem, and this is why Maven is so popular around Spark despite being objectively the wrong tool for the job here.

Deequ is a fairly simple project to build (my current build.sbt is 35 LoC) even if you add cross Scala/Spark versions support, the Mill build won't be significantly simpler.

@hygt SBT is definitely fine as well and a big step forward from Maven!
Would you mind opening a PR with your SBT implementation?

@nightscape @hygt Thanks for the helpful information on Mill and SBT. We did some work last year to move our build to SBT, but we did not create a PR for it.

If possible, as a first step, can we have the SBT implementation side by side with the Maven implementation? That way, we can get SBT builds integrated into the project quickly and we can fall back to the Maven build for deploying.