/pySeqClust

clustering of time sequences

Primary LanguagePythonMIT LicenseMIT

Sequence Clustering

Build Status CircleCI Build status codecov Codacy Badge

Clustering on consecutive sequences like sentences where the order matters. The core is based on Dynamic Time Warping for comparing difference between individual sequences and agglomerative clustering which construct the clusters from bottom-up.

Let's have sample set of sequences and perform clustering.

Hi there, how are you?
hi how you are
i like to sing
I am going to sing
hi where you are
hi are you there...
do you sing???

with binary distance between block and sett 3 clusters we bot following results:

sentence clusters internal dist.
hi how you are [0, 1, 4, 5] 0.095313
i like to sing [2, 3] 0.100000
do you sing [6] 0.000000

The agglomerative clustering has two stop criteria, one is number of clusters and second is maximal internal distance inside cluster. The "pivot" the most representative sample from cluster is selected as such with minimal distance to all others inside own cluster.