Question about code of spark.py

thanks for the great code, I have some understanding problems about the following code:

a = edges
    while True:
        b = a.flatMap(large_star_map).groupByKey().flatMap(large_star_reduce).distinct().cache()
        a = b.map(small_star_map).groupByKey().flatMap(small_star_reduce).distinct().cache()
        changes = a.subtract(b).union(b.subtract(a)).collect()
        if len(changes) == 0:
            break

this code is used to clustering the total docs by the generated edges(which means duplication) and remove the duplicated docs? If so, variable a is the texts after dedup?

This particular piece is indeed clustering edges into connected components. This is described in paper https://dl.acm.org/doi/10.1145/2670979.2670997.

To give an example, the input (edges) before the while loop:

x	y
2	1
2	3

The output after the while loop is something like this:

id	component
2	1
3	1

But the duplicates are not removed here, they are removed at line

text-dedup/text_dedup/spark.py

Line 270 in 14ea1ed

df = df.filter(F.col("component").isNull()).drop("__id__", "component").cache()

Any node that does have a component id assigned, in this case 2 and 3 will be removed. 1 is kept, along with any node that is not included in the edges in the first place.

If interested, there is a presentation from one of the authors.

@ChenghaoMou @KeremTurgutlu thanks for your presentations, that's helpful