Tracking issue: Implementation of lacking core RDD ops
Opened this issue · 6 comments
iduartgomez commented
For core RDD ops we understand those which spawn in the original Apache Spark from SparkContext and/or the base RDD class and friends:
SC:
- range
- filter
- randomSplit
- sortBy
- groupBy
- keyBy
- zipPartitions
- intersection
- pipe
- zip
- substract
- treeAggregate
- treeReduce
- countApprox
- countByValue
- countByValueApprox
- min and max
- top
- takeOrdered
- isEmpty
Non-goals for this tracking issue are any I/O related ops as we are tracking those elsewhere and doing things a little bit differently:
- textFile
- wholeTextFiles
- binary files | binary records
- Hadoop* family of methods
iduartgomez commented
Intersection completed in #66
iduartgomez commented
range done in #82
GavrielPlotke commented
@iduartgomez - Isn't substract
a misspelling of subtract
?
rajasekarv commented
fixed @GavrielPlotke
ajprabhu09 commented
what would the subtract operation entail, can someone give an example?
GavrielPlotke commented
Doc: https://spark.apache.org/docs/1.0.2/api/java/org/apache/spark/rdd/RDD.html#subtract(org.apache.spark.rdd.RDD)
Example:
- I have a list of customers that I want to advertise to
- I have a list of angry customers who have said "DON'T TALK TO ME!"
email_list_rdd = customers_rdd.subtract(angry_rdd)