Using the Min-Hash algorithm to compare different docs’ similarities.
We skip the step of splitting words.
It’s a simple and crude code implementation in Python in O(N^3) complexity.
You may find many redundant data structures (forgive it, it’s just derived from a tiny homework), but the whole process follows the origin theory clearly.