Question about code of spark.py
Closed this issue · 3 comments
thanks for the great code, I have some understanding problems about the following code:
a = edges
while True:
b = a.flatMap(large_star_map).groupByKey().flatMap(large_star_reduce).distinct().cache()
a = b.map(small_star_map).groupByKey().flatMap(small_star_reduce).distinct().cache()
changes = a.subtract(b).union(b.subtract(a)).collect()
if len(changes) == 0:
break
this code is used to clustering the total docs by the generated edges(which means duplication) and remove the duplicated docs? If so, variable a
is the texts after dedup?
This particular piece is indeed clustering edges into connected components. This is described in paper https://dl.acm.org/doi/10.1145/2670979.2670997.
To give an example, the input (edges) before the while loop:
x | y |
---|---|
2 | 1 |
2 | 3 |
The output after the while loop is something like this:
id | component |
---|---|
2 | 1 |
3 | 1 |
But the duplicates are not removed here, they are removed at line
text-dedup/text_dedup/spark.py
Line 270 in 14ea1ed
Any node that does have a component id assigned, in this case 2 and 3 will be removed. 1 is kept, along with any node that is not included in the edges in the first place.
If interested, there is a presentation from one of the authors.
@ChenghaoMou @KeremTurgutlu thanks for your presentations, that's helpful