Lab41/hermes

`randomSplit()` produces incorrect results for the Git Dataset on Spark 1.4

Closed this issue · 1 comments

agude commented

The user vectors for the Git dataset are being "mangled" by this line:

train_ratings, test_ratings = user_info.randomSplit([0.9,0.1], 41)

If you recombine the test and train vectors, they new combined RDD has the right number of rows, but the number of distinct users and items changes. It looks like some duplicate rows are introduced and some rows are dropped.

This problem does not seem to effect any other datasets (we tested them all), and is not present when run locally on my laptop using Spark 1.6. It only shows up when running on the cluster with Spark 1.4.

We have temporarily worked around this problem by saving the test and training vectors on my laptop and moving them to the cluster.

The bug also persists on Spark 1.5.2.