monotonically_increasing_id is not consecutive
henrykey opened this issue · 5 comments
Hi,
the doc ids generated with following method is not consecutive
val docIdsDF = docTermFreqs.withColumn("id", monotonically_increasing_id)
so, when the doc id is being looked up with score id generated with different way, they are not matching and error generated
val docWeights = u.rows.map(_.toArray(i)).zipWithUniqueId
topDocs += docWeights.top(numDocs).map{case (score, id) => (docIds(id), score)}
what i talk about is your update version of chapter 6 of LSA
how can I solve the issue? use the same way of zipWithUniqueId to generate id?
regards,
Henry
Yes, it works with zipWithIndex.
val docIds=docTermFreqs.rdd.map(t =>t.getAsString)
.zipWithUniqueId().map(f=>(f._2,f._1)).collect().toMap
I need to change the code as following, as getAsString is NOT available on Row on Spark 2.0.2, which I tested in my environment.
val docIds = docTermFreqs.rdd.map(t => t.getString(0)).zipWithUniqueId().map(f=>(f._2, f._1)).collect().toMap