RedPajama_Analysis

This is a project that Analyze the Large Dataset "RedPajama". Our purpose is to get the duplicated documents pair in the project and then remove most of them to get a deduplicated dataset.

Analysis Tools

We use three thechniques to analyze it.

Overlapjoin
OvplJoinRS
Setjoin

OvplJoinRS

OverlapJoin for two sets (R and S). In our implementation, we treat the bottom-k of corpus as set R and the unqiue set of the corpus as set S. So that we can find all the document pairs <a,b> such that there exists a segment in document b whose jaccard similarity with document a exceeds threshold theta

rjzhb/RedPajama_Analysis

RedPajama_Analysis

Analysis Tools

OvplJoinRS