/HierarchicalClustering

Hierarchical Clustering for short texts in web-scale

Hierarchical Clustering for Documents Data

There is a modified hierarchial clustering method for large scale and
sparse data, such as documents corpus, which only need one scan of
origin data. The key points different from tranditional hierarchial
clustering are
	1. A dimAarray structure is utilized to speed up similar clusters
	search
    2. A diameter parameter 'threshold' is used to control aggregation: during the
	doc travesal process, current doc is combined into the closest
	clusters if their distance is less than 'threshold', else it forms
	a new clusters.

Distance between a doc and a cluster is computed by the Euclidean
distance between the doc and the avarage of documents in the cluster.

Input(setting in config.in)
    1. doc.txt    each line is a doc in format:
	   id:v,id:v,...

Output
    1. result_pt in config.in
	   format:(output:centroid    docs)
	          w1:weight,w2:weight,...   doc1, doc2, ...

Data Structure
	 1.Word  A word unit, with member variables:
	 		  long id;
			  double v;
	 2.Doc   A document unit, with member variables:
	 		  long id;
			  vector<Word> m_words;  //weight of words are normalized in doc
	 3.Cluster  A cluster unit with member variables:
	 	     	 long m_id;
				 //Used to compute distance between two clusters, may change to be mean or others.
				 list<Word> m_centroid;   
				 vector<const Doc*> m_docs;
	 4.DimArray  Index clusters by words, used to speed up the similiar clusters search process.
	  			 vector< list<Cluster*> > dim;
				 This data structure is descripted in "Context-Aware
				 	 Query Suggestion by Mining Click-Through and
					 Session Data", Huanhuan Cao et.al. KDD 2008
   	 5.Clustering  Main work class
	 
NOTICE:
     The conifig.in should be UNIX format, dos2unix command will help to
     convert a dos format to UNIX one. Otherwise, the parameters may will
     not resolved (string.substr() will return wrong reuslt).

 聚类流程
1. 逐个文档,从前往后,依次生成clusters
   1) 第一个文档,对应第一个cluster
   2) 从第二个以后的文档,先往前找最近的cluster(通过dimarray缩小检索范
   围),如果相似度大于阈值,则合并;反之,生成一个独立的cluster

改进策略