Shirlly/Min_hash_Incremental_Clustering

Java

Min_hash_Incremental_Clustering

Cluster text data based on a combination of min_hash clustering and incremental clustering. By applying min_hash clustering, near duplicate text could be identified efficiently.

Input data:

Each line in the input file is considered as one document to be cluseterd.
Format: A &#& B &#& Text &#& D
Can change the input data format and delimiter accordingly.

Output data:

Same sequence as input data and associated with its corresponding cluster label
Can save cluster elements as well.