tslearn-team/tslearn

N-dimensional features issue in the method

zandarina1 opened this issue · 2 comments

Hello all,

I want to use two dimensions, two time series for each participant. I transform the data as expected by the library

(6431, 5, 2)

However if I plot it, it puts together both signals in one single plot, I am not sure if they are considering the features separatelly, that this is what i want for example participant 1 with series A increasing and series B dicreaseing is cluster 1. But what I get, it does not make sense, it makes the same as if it was in one dimension and if I plot it, it does not make sense by separating X_train[y_pred == yi,:,1] or X_train[y_pred == yi,:,0], and the cluster centers are the same for both series /dims. How can I plot when I have two dimensions and make the clusters differentiate by dimensions?. It would be great to show an example with multiple dimensions apart from the nice examples of the tutorial, Thanks

for yi in range(N_CLUSTERS):
    plt.subplot(2, 3, 1 + yi)
    for xx in X_train[y_pred == yi,:,1]:
        plt.plot(xx.ravel(), "k-", alpha=.2)
    plt.plot(km.cluster_centers_[yi,:,1].ravel(), "r-")
    plt.xlim(0, sz)
    plt.ylim(-4, 4)
    plt.text(0.55, 0.85,'Cluster %d' % (yi + 1),
             transform=plt.gca().transAxes)
    if yi == 1:
        plt.title("DTW $k$-means")

Hello @zandarina1,
I think that your problem comes from the misuse of numpy.ravel which flattens the NumPy arrays:
https://numpy.org/doc/stable/reference/generated/numpy.ravel.html#numpy.ravel

Taking inspiration from:
https://tslearn.readthedocs.io/en/stable/auto_examples/clustering/plot_kmeans.html#sphx-glr-auto-examples-clustering-plot-kmeans-py
I have written the following code:

import numpy
import matplotlib.pyplot as plt
import numpy as np

from tslearn.clustering import TimeSeriesKMeans
from tslearn.datasets import CachedDatasets
from tslearn.preprocessing import TimeSeriesScalerMeanVariance, \
    TimeSeriesResampler

seed = 0
numpy.random.seed(seed)
X_train, y_train, X_test, y_test = CachedDatasets().load_dataset("Trace")
print(X_train.shape)  # (100, 275, 1)
X_train = np.concatenate([X_train, - X_train], axis=2)
print(X_train.shape)  # (100, 275, 2)
X_train = X_train[y_train < 4]  # Keep first 3 classes
numpy.random.shuffle(X_train)
# Keep only 50 time series
X_train = TimeSeriesScalerMeanVariance().fit_transform(X_train[:50])
# Make time series shorter
X_train = TimeSeriesResampler(sz=40).fit_transform(X_train)
sz = X_train.shape[1]
print(sz)

# Soft-DTW-k-means
print("Soft-DTW k-means")
sdtw_km = TimeSeriesKMeans(n_clusters=3,
                           metric="softdtw",
                           metric_params={"gamma": .01},
                           verbose=True,
                           random_state=seed)
y_pred = sdtw_km.fit_predict(X_train)

for yi in range(3):
    for di in range(2):
        plt.subplot(2, 3, 1 + yi + 3 * di)
        for xx in X_train[y_pred == yi]:
            plt.plot(xx[:, di], "k-", alpha=.2)
        plt.plot(sdtw_km.cluster_centers_[yi, :, di], "r-")
        plt.xlim(0, sz)
        plt.ylim(-4, 4)
        plt.text(0.05, 0.85, f"Cluster {yi + 1}, dim {di + 1}",
                 transform=plt.gca().transAxes)
        if yi == 1 and di == 0:
            plt.title("Soft-DTW $k$-means")

plt.tight_layout()
plt.show()

Does it correspond to what you would like to do?