Self-Supervised Representations of Geolocated Weather Time Series - an Evaluation and Analysis

Author: Arjun Ashok Mentors: Jitendra Singh, Devyani Lambhate

This repository contains the code for multivariate time series representation learning based on two self-supervised learning methods, and the code for finetuning/adapting the representations for a downstream task.

The supported methods are:

TS2Vec
CoST

The supported downstream tasks are:

Multi-horizon forecasting, both univariate and multivariate
Regression
Classification
Anomaly Detection

Requirements

The recommended requirements for TS2Vec are specified as follows:

Python 3.8
torch==1.8.1
scipy==1.6.1
numpy==1.19.2
pandas==1.0.1
scikit_learn==0.24.2
statsmodels==0.12.2
Bottleneck==1.3.2

The dependencies can be installed by:

pip install -r requirements.txt

Data

The datasets can be obtained and put into datasets/ folder in the following way:

128 UCR datasets should be put into datasets/UCR/ so that each data file can be located by datasets/UCR/<dataset_name>/<dataset_name>_*.csv.
30 UEA datasets should be put into datasets/UEA/ so that each data file can be located by datasets/UEA/<dataset_name>/<dataset_name>_*.arff.
3 ETT datasets should be placed at datasets/ETTh1.csv, datasets/ETTh2.csv and datasets/ETTm1.csv.
Electricity dataset should be preprocessed using datasets/preprocess_electricity.py and placed at datasets/electricity.csv.
Yahoo dataset should be preprocessed using datasets/preprocess_yahoo.py and placed at datasets/yahoo.pkl.
KPI dataset should be preprocessed using datasets/preprocess_kpi.py and placed at datasets/kpi.pkl.

Usage

To train and evaluate TS2Vec on a dataset, run the following command:

python train.py <dataset_name> <run_name> --loader <loader> --batch-size <batch_size> --repr-dims <repr_dims> --gpu <gpu> --eval

The detailed descriptions about the arguments are as following:

Parameter name	Description of parameter
dataset_name	The dataset name
run_name	The folder name used to save model, output and evaluation metrics. This can be set to any word
loader	The data loader used to load the experimental data. This can be set to `UCR`, `UEA`, `forecast_csv`, `forecast_csv_univar`, `anomaly`, or `anomaly_coldstart`
batch_size	The batch size (defaults to 8)
repr_dims	The representation dimensions (defaults to 320)
gpu	The gpu no. used for training and inference (defaults to 0)
eval	Whether to perform evaluation after training
method	The method to use (either `cost` or `ts2vec`)

(For descriptions of more arguments, run python train.py -h.)

After training and evaluation, the trained encoder, output and evaluation metrics can be found in training/DatasetName__RunName_Date_Time/.

Scripts: The scripts for reproduction are provided in scripts/ folder.

Code Example

An example with TS2Vec as the method is given below:

from ts2vec import TS2Vec
import datautils

# Load the ECG200 dataset from UCR archive
train_data, train_labels, test_data, test_labels = datautils.load_UCR('ECG200')
# (Both train_data and test_data have a shape of n_instances x n_timestamps x n_features)

# Train a TS2Vec model
model = TS2Vec(
    input_dims=1,
    device=0,
    output_dims=320
)
loss_log = model.fit(
    train_data,
    verbose=True
)

# Compute timestamp-level representations for test set
test_repr = model.encode(test_data)  # n_instances x n_timestamps x output_dims

# Compute instance-level representations for test set
test_repr = model.encode(test_data, encoding_window='full_series')  # n_instances x output_dims

# Sliding inference for test set
test_repr = model.encode(
    test_data,
    casual=True,
    sliding_length=1,
    sliding_padding=50
)  # n_instances x n_timestamps x output_dims
# (The timestamp t's representation vector is computed using the observations located in [t-50, t])

Acknowledgements

This codebase uses parts of code from the following repositories:

TS2Vec - https://github.com/yuezhihan/ts2vec
CoST - https://github.com/salesforce/CoST