monotonically_increasing_id is not consecutive

Question

monotonically_increasing_id is not consecutive

henrykey opened this issue 8 years ago · 5 comments

Hi,
the doc ids generated with following method is not consecutive
val docIdsDF = docTermFreqs.withColumn("id", monotonically_increasing_id)
so, when the doc id is being looked up with score id generated with different way, they are not matching and error generated
val docWeights = u.rows.map(_.toArray(i)).zipWithUniqueId
topDocs += docWeights.top(numDocs).map{case (score, id) => (docIds(id), score)}

what i talk about is your update version of chapter 6 of LSA

how can I solve the issue? use the same way of zipWithUniqueId to generate id?

regards,
Henry

Answer 1 · 2016-10-21T10:13:14.000Z

@sryza does zipWithIndex work here?

Answer 2 · 2016-10-22T02:40:38.000Z

Yes, it works with zipWithIndex.
val docIds=docTermFreqs.rdd.map(t =>t.getAsString)
.zipWithUniqueId().map(f=>(f._2,f._1)).collect().toMap

Answer 3 · 2016-12-05T07:23:40.000Z

@henrykey thanks for catching this.

Answer 4 · 2017-02-21T22:19:54.000Z

I need to change the code as following, as getAsString is NOT available on Row on Spark 2.0.2, which I tested in my environment.

val docIds = docTermFreqs.rdd.map(t => t.getString(0)).zipWithUniqueId().map(f=>(f._2, f._1)).collect().toMap

Answer 5 · 2017-04-02T05:03:29.000Z

This should be fixed by #104. Thanks for reporting!