Cluster text data based on a combination of min_hash clustering and incremental clustering. By applying min_hash clustering, near duplicate text could be identified efficiently.
- Each line in the input file is considered as one document to be cluseterd.
- Format: A &#& B &#& Text &#& D
- Can change the input data format and delimiter accordingly.
- Same sequence as input data and associated with its corresponding cluster label
- Can save cluster elements as well.