MrPowers/spark-daria

Prepping for Spark 3

Closed this issue ยท 12 comments

Databricks Runtime 7 is using Spark 3... the writing is on the wall... Spark 3 will be released soon!

The goal of this post is to outline how spark-daria should proceed when Spark 3 is released. I'd like to hear thoughts from the other project maintainers.

Looks like Spark only made one 1.x release after Spark 2.0.0 was released:

Screen Shot 2020-04-06 at 12 05 26 PM

Other Spark libraries like spark-testing-base did a good job supporting Spark 1.x long after Spark 2.0 was released.

We'll be able to delete a bunch of spark-daria collection functions when Spark 3 is released.

It's unlikely my brain will be able to handle maintaining Spark 2 legacy stuff and new Spark 3 stuff. I am leaning towards a spark-daria feature freeze and diverting all dev resources to shoring up the final Spark 2 release.

Once Spark 3 is released, I'd like to delete functions that are no longer needed and rethink functions that have been added along the way. This library was started in February 2017 when I knew nothing about Spark or Scala. There's bound to be some cruft that should be deleted.

I will get a Spark 3 branch going now to see what the upgrade will be like. Once the Spark 3 branch is merged in, I will try to get this library ready for a 1.0 release and a stable API.

Definitely interested in thoughts / comments / suggestions from the other maintainers / interested community members.

Agree, we can make sure that we do not migrate less used parts, or refactor others.

For example, I was wondering why do you use local spark session here


Instead of accepting it as dependency.

I'd suggest that we adopt the same maintenance policy as Spark does, 2.4.x and 3.x. It will take a while for people to upgrade Spark 3, when this library will be useful (arguably, more useful since we equip them with features from Spark 3). That said, it looks good to get rid of legacy stuff older than those versions.

@manuzhang - using the Spark maintenance policy sounds like a good approach. Thanks!

Do you know where Spark lists the versions that are still being maintained? Node has a nice release schedule (well, nicely published, not nice that it changes so much). I did some Googling and couldn't find anything like this for Spark.

Thanks again for the great ideas.

Is a version of this available for Spark 3.0.0? Am eager to try it out.

@MrPowers is there any ETA on a 3.0 compatible version?

@dbeavon @darrenhaken - thanks for reaching out. The spark-daria JAR files that are compiled with Scala 2.12 can be used with Spark 3, here's an example.

scalaVersion := "2.12.10"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.0.0" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-mllib" % "3.0.0" % "provided"

libraryDependencies += "mrpowers" % "spark-daria" % "0.37.1-s_2.12" % "test"
libraryDependencies += "MrPowers" % "spark-fast-tests" % "0.21.1-s_2.12" % "test"
libraryDependencies += "org.scalatest" %% "scalatest" % "3.0.1" % "test"

Let me know if the Scala 2.12 JAR works for you!

@MrPowers I will test this on Monday, thanks for getting back to me with a solution. Fingers crossed

Sorry for the basic question. What does the % "test" part do?

I'm trying to use this from within jupyter lab. I specify that dependency with the interop.load.ivy command using the following identifiers:
org="mrpowers" name="spark-daria" rev="0.37.0-s_2.12"

... and it is trying to download from the following repository:
https://mvnrepository.com/artifact/mrpowers/spark-daria/0.37.0-s_2.12

... but I get an error. The error is a bit obscure.
java.lang.Exception: Failed to resolve ivy dependencies:Error downloading mrpowers:spark-daria:0.37.0-s_2.12
not found: C:\Users\dbeavon.ivy2\local\mrpowers\spark-daria\0.37.0-s_2.12\ivys\ivy.xml

For now I think I'll just try to download the jar itself and explicitly load it from a local directory. That may work better than relying on ivy/maven. I'll let you know if it doesn't work.

Just a quick follow-up. I got the jar to work by just downloading it manually.

Given that jupyter lab wasn't able to download that dependency, I also tested from intellij/sbt and it gives the a similar error message, like so:

[error] (update) sbt.librarymanagement.ResolveException: Error downloading mrpowers:spark-daria:0.37.1-s_2.12
[error] Not found
[error] Not found
[error] not found: C:\Users\dbeavon.ivy2\local\mrpowers\spark-daria\0.37.1-s_2.12\ivys\ivy.xml
[error] not found: https://repo1.maven.org/maven2/mrpowers/spark-daria/0.37.1-s_2.12/spark-daria-0.37.1-s_2.12.pom
[error] (ssExtractDependencies) sbt.librarymanagement.ResolveException: Error downloading mrpowers:spark-daria:0.37.1-s_2.12

Created a Spark 3 branch: #135

Going to make a major 1.0 release once the project is officially switched over to Spark 3.

The project will stop supporting Spark 2 with new features, but the old JAR files will still be available for anyone that's still on Spark 2.

Let me know if you have any feedback / comments on the plan!