baraline/convst

Shapelet and TS extraction

nirojasva opened this issue · 1 comments

Hello,

I'm trying to use the RDST implementation for TSC task, however, I'm interested in the interpretability of the method. So, I would like to confirm what it is the correct way to make the extraction of the shapelets of the model and the time series that generate them.

Thanks in advance,

Hi, thank you for your interest in the method !

If you have a R_DST_Ridge object that you fitted on your problem, be able to use the RDST_Ridge_interpreter class and use the visualize_best_shapelets_one_class method to visualize the "best" shapelets for each class. Note that this only work for univariate data for now.
The notion of "best" is determined by the coefficients of the Ridge Classifier, if the coefficients linked to the features of a shapelet are high, it means that the shapelets that was used to generate them is important to discriminate the class.

Concerning the time series that generate a shapelet, unfortunately, I do not store the id of the time series from which shapelets are extracted.

Now if you want to extract the shapelets and do some analysis yourself, you can take inspiration from the visualize_best_shapelets_one_class method. I've not tested the following, so it might throw an error, but the idea is here:

your_classifier = RDST_Ridge().fit(X, y) 
...
# Extract the coefficients linked to the 3 x n_shapelets features extracted
coefs = your_classifier.classifier['ridgeclassifiercv'].coef_
n_classes = coefs.shape[0]

# Trick for the binary classification case, ridge classifier only store coefs for class 1 in this case. 
if n_classes == 1:
    coefs = np.append(-coefs, coefs, axis=0)

"""
All informations about shapelets are stored in `your_classifier.transformer.shapelets_`,
which is a list storing the following formations :
For univariate time series :
        values (-> your_classifier.transformer.shapelets_[0])
        lengths (-> your_classifier.transformer.shapelets_[1])
        dilations (-> your_classifier.transformer.shapelets_[2])
        threshold (-> your_classifier.transformer.shapelets_[3])
        normalize (-> your_classifier.transformer.shapelets_[4])

For multivariate time series :
        values (-> your_classifier.transformer.shapelets_[0])
        lengths (-> your_classifier.transformer.shapelets_[1])
        dilations (-> your_classifier.transformer.shapelets_[2])
        threshold (-> your_classifier.transformer.shapelets_[3])
        normalize (-> your_classifier.transformer.shapelets_[4])
        n_channels  (-> your_classifier.transformer.shapelets_[5])
        channel_mask  (-> your_classifier.transformer.shapelets_[5])
"""
# Define number of shapelet you want to extract and for which class
n_shp = 1
class_id = 0

# Initialize an array of size 3 * n_shapelets
coefs_= np.zeros(your_classifier.transformer.shapelets_[1].shape[0]*3)
# usefull_atts is a boolean mask storing attributes with non zero std (i.e. those who are not constant during fit)
coefs_[self.RDST_Ridge.classifier['c_standardscaler'].usefull_atts] = coefs[class_id]

# Sort indexes of the features from best(max) to worst(min) coef values
# divide by 3 to get the id of the shapelet that generated each feature
idx = (coefs_.argsort()//3)[::-1]
# Extract the shapelets ids
shp_ids = []
i=0
while len(shp_ids)<n_shp and i < idx.shape[0]:
    if idx[i] not in shp_ids:
        shp_ids = shp_ids + [idx[i]]
    i+=1
# Now you have the ids of the best shapelets in shp_ids 

I strongly recommend you to look at the rdst_interpreter.py file to get a look at how you can extract the information you need from the RDST transformer/classifier. Some things might not be trivial to extract due to the structure I had to go with to comply with numba expected input types.

Don't hesitate to ask more questions if needed.