Picks documents from Redis and creates clusters using Apache Mahout clustering tools.
Place documents in local Redis instance.
documents.count
should contain total count of document. This application will start fetching documents from 1 tocount
.document.1
..documents.N
should contain the documents.
mvn clean compile exec:java -Dexec.mainClass=in.vivekjain.document.clustering.Main
If you have thousands of documents, you might need to provide additional memory(like export MAVEN_OPTS="-Xmx10240m"
).
The final output will list all documents pre-fixed by the following information:
cluster-id weight distance document-name
You can tweak Result.toString
method to get the desired output.
- If you have too many documents to cluster, you might face
OutOfMemory
error even after providing additional memory. You will need to either reduce the number of documents in Redis or specify theDOCUMENTS_TO_CLUSTER
environment variable to limit the number of documents to be picked up:
For example, export DOCUMENTS_TO_CLUSTER="1000"
will enable clustering of only 1000 documents.