HierarchicalClustering: A repository from l0he1g

Hierarchical Clustering for Documents Data

There is a modified hierarchial clustering method for large scale and
sparse data, such as documents corpus, which only need one scan of
origin data. The key points different from tranditional hierarchial
clustering are
1. A dimAarray structure is utilized to speed up similar clusters
search
2. A diameter parameter 'threshold' is used to control aggregation: during the
doc travesal process, current doc is combined into the closest
clusters if their distance is less than 'threshold', else it forms
a new clusters.

Distance between a doc and a cluster is computed by the Euclidean
distance between the doc and the avarage of documents in the cluster.

Input(setting in config.in)
1. doc.txt each line is a doc in format:
id:v,id:v,...

Output
1. result_pt in config.in
format:(output:centroid docs)
w1:weight,w2:weight,... doc1, doc2, ...

Data Structure
1.Word A word unit, with member variables:
long id;
double v;
2.Doc A document unit, with member variables:
long id;
vector<Word> m_words; //weight of words are normalized in doc
3.Cluster A cluster unit with member variables:
long m_id;
//Used to compute distance between two clusters, may change to be mean or others.
list<Word> m_centroid;
vector<const Doc*> m_docs;
4.DimArray Index clusters by words, used to speed up the similiar clusters search process.
vector< list<Cluster*> > dim;
This data structure is descripted in "Context-Aware
Query Suggestion by Mining Click-Through and
Session Data", Huanhuan Cao et.al. KDD 2008
5.Clustering Main work class

NOTICE:
The conifig.in should be UNIX format, dos2unix command will help to
convert a dos format to UNIX one. Otherwise, the parameters may will
not resolved (string.substr() will return wrong reuslt).

聚类流程
1. 逐个文档，从前往后，依次生成clusters
1) 第一个文档，对应第一个cluster
2) 从第二个以后的文档，先往前找最近的cluster(通过dimarray缩小检索范
围)，如果相似度大于阈值，则合并；反之，生成一个独立的cluster

改进策略

l0he1g/HierarchicalClustering