This is a project that Analyze the Large Dataset "RedPajama". Our purpose is to get the duplicated documents pair in the project and then remove most of them to get a deduplicated dataset.
We use three thechniques to analyze it.
- Overlapjoin
- OvplJoinRS
- Setjoin
OverlapJoin for two sets (R and S). In our implementation, we treat the bottom-k of corpus as set R and the unqiue set of the corpus as set S. So that we can find all the document pairs <a,b> such that there exists a segment in document b whose jaccard similarity with document a exceeds threshold theta