rajasekarv/vega

Tracking issue: Implementation of lacking core RDD ops

Opened this issue · 6 comments

For core RDD ops we understand those which spawn in the original Apache Spark from SparkContext and/or the base RDD class and friends:
SC:

  • range
  • filter
  • randomSplit
  • sortBy
  • groupBy
  • keyBy
  • zipPartitions
  • intersection
  • pipe
  • zip
  • substract
  • treeAggregate
  • treeReduce
  • countApprox
  • countByValue
  • countByValueApprox
  • min and max
  • top
  • takeOrdered
  • isEmpty

Non-goals for this tracking issue are any I/O related ops as we are tracking those elsewhere and doing things a little bit differently:

  • textFile
  • wholeTextFiles
  • binary files | binary records
  • Hadoop* family of methods

Intersection completed in #66

range done in #82

@iduartgomez - Isn't substract a misspelling of subtract ?

what would the subtract operation entail, can someone give an example?

Doc: https://spark.apache.org/docs/1.0.2/api/java/org/apache/spark/rdd/RDD.html#subtract(org.apache.spark.rdd.RDD)
Example:

  • I have a list of customers that I want to advertise to
  • I have a list of angry customers who have said "DON'T TALK TO ME!"
    email_list_rdd = customers_rdd.subtract(angry_rdd)