sryza/aas

monotonically_increasing_id is not consecutive

henrykey opened this issue · 5 comments

Hi,
the doc ids generated with following method is not consecutive
val docIdsDF = docTermFreqs.withColumn("id", monotonically_increasing_id)
so, when the doc id is being looked up with score id generated with different way, they are not matching and error generated
val docWeights = u.rows.map(_.toArray(i)).zipWithUniqueId
topDocs += docWeights.top(numDocs).map{case (score, id) => (docIds(id), score)}

what i talk about is your update version of chapter 6 of LSA

how can I solve the issue? use the same way of zipWithUniqueId to generate id?

regards,
Henry

@sryza does zipWithIndex work here?

Yes, it works with zipWithIndex.
val docIds=docTermFreqs.rdd.map(t =>t.getAsString)
.zipWithUniqueId().map(f=>(f._2,f._1)).collect().toMap

sryza commented

@henrykey thanks for catching this.

I need to change the code as following, as getAsString is NOT available on Row on Spark 2.0.2, which I tested in my environment.

val docIds = docTermFreqs.rdd.map(t => t.getString(0)).zipWithUniqueId().map(f=>(f._2, f._1)).collect().toMap

sryza commented

This should be fixed by #104. Thanks for reporting!