tslearn-team/tslearn

i don't think dtw_path_from_metric uses distance metrics correctly to calculate the path or similarity

Waponiwoo opened this issue · 2 comments

Describe the bug
by using dist_mat = pairwise_distances(s1, s2, metric=metric, **kwds) to pre calculate a distance matrix value by value, any metric that needs the full vector for distance calculation comes out wrong. for example Euclidean: will return a distance matrix of sqrt(s1[i]-s2[j])**2, or essentially |s1[i]-s2[j]|. later when used to calculate the path in njit_accumulated_matrix_from_dist_matrix we are taking the minimum of the sum of absolute values instead of the sums of the square distance at each step.
in the normal dtw function it works fine because the distance for each step is _local_squared_dist(s1[i], s2[j]) giving the square if each difference for the correct proportional weighting to find the path. we are only taking the sqrt at the end for the correct similarity score in njit_dtw.

To Reproduce
run the same s1, s2 with Euclidean through dtw and dtw_path_from_metric. the similarity score will be different. dtw will return the actual Euclidean distance, the from_metric will return the sum of absolute value point distances. i think the path can be different with some s1, s2s.

Expected behavior
using the metric to get the real cost each step. i can't think of a great way to do it, but one way would be to call the pairwise_distances(s1_partial, s2_partial, metric=metric, **kwds) for all three options of cum_sum[i + 1, j + 1] for each step and take the min. for many metrics that would require backtracking the best path to each of the three options however.

Environment (please complete the following information):

  • OS: [e.g. iOS]
  • tslearn version [e.g. 0.2.2]

Additional context
Add any other context about the problem here.

Hi,

The metric provided to tslearn.metrics.dtw_path_from_metric is used to compare pairs of elements of both time series not to compute overall distance between both (partial) time series. The score returned by the function is the sum of the metric along the path, not a comparison of both time series using the metric. Cf the doc of the function:

Similarity is computed as the cumulative cost along the aligned time series.

To obtain the same path as tslearn.metrics.dtw you can simply use metric = "sqeuclidean", both paths will be the same and the returned score will be the square of the one obtained with regular dtw (sum of squared L2 norm along the path).

For example:

import numpy as np
from tslearn.metrics import dtw_path, dtw_path_from_metric
np.random.seed(0)
s1, s2 = np.random.random(1000), np.random.random(1000)
path_dtw, score_dtw = dtw_path(s1, s2)
path_metric, score_metric = dtw_path_from_metric(s1, s2, metric="sqeuclidean")
print(path_metric == path_dtw)
print(np.sqrt(score_metric) == score_dtw)

Yields:

True
True

This is mentioned in the doc of the function, in the Note section:

By using a squared euclidean distance metric as shown above, the output path is the same as the one obtained by using dtw_path but the similarity score is the sum of squared distances instead of the euclidean distance.

I hope this helps :)

@Waponiwoo let us know if @rfayat 's answer is not sufficient (it seems to me that it is, but...)