dfdx/Spark.jl

Roadmap

Closed this issue · 5 comments

dfdx commented

This is a meta-issue to track progress of the package development.

API:

  • Basic JuliaRDD
  • Communication between Spark and Julia worker
  • Typed RDD
  • Core RDD methods (e.g. map_partitions_with_index, collect, etc.)
  • Custom data formats
  • parallelize
  • Repartitioning functions
  • Sampling functions
  • Broadcasting variables
  • Named functions

Masters:

  • local
  • Standalone
  • YARN / Client
  • YARN / Cluster
  • Mesos / Client
  • Mesos / Cluster

Stability:

  • Tests (requires parallelize)

Just wanted to tell you guys keep up the good work! I look forward to seeing this project completed and I'll be happy to promote it for you, once it is ready for a larger audience.

dfdx commented

I don't really think there's a notion of "complete" for this project. Apache Spark evolves all the time, and so Spark.jl does, but it's unlikely that Spark.jl will ever cover all the features of Java/Scala version (note that much more developed SparkR covers only maybe a half of Scala API and even PySpark doesn't reach 100% coverage).

The approach we follow is to add features that are most often asked for. If you have something in mind, please don't hesitate to open an issue.

dfdx commented

Closing it as a terribly outdated roadmap.

has this roadmap been updated?

dfdx commented

Not quite. This roadmap was created for Spark 1.x and RDD interface, while current approach is to wrap DataFrame API. Do you have any specific requirement you want to know about?