Clustering on consecutive sequences like sentences where the order matters. The core is based on Dynamic Time Warping for comparing difference between individual sequences and agglomerative clustering which construct the clusters from bottom-up.
Let's have sample set of sequences and perform clustering.
Hi there, how are you?
hi how you are
i like to sing
I am going to sing
hi where you are
hi are you there...
do you sing???
with binary distance between block and sett 3 clusters we bot following results:
sentence | clusters | internal dist. |
---|---|---|
hi how you are | [0, 1, 4, 5] | 0.095313 |
i like to sing | [2, 3] | 0.100000 |
do you sing | [6] | 0.000000 |
The agglomerative clustering has two stop criteria, one is number of clusters and second is maximal internal distance inside cluster. The "pivot" the most representative sample from cluster is selected as such with minimal distance to all others inside own cluster.