Self-Supervised Contrastive Learning for Medical Time Series: A Systematic Review

This is for the survey paper Self-Supervised Contrastive Learning for Medical Time Series: A Systematic Review which was published in Sensors in 2023.

Authors: Ziyu Liu (ziyu.liu2@student.rmit.edu.au), Azadeh Alavi (azadeh.alavi@rmit.edu.au), Minyi Li (liminyi0709@gmail.com) and Xiang Zhang (xiang.zhang@uncc.edu)

Paper link

Summary:

We carefully reviewed 43 papers in the field of self-supervised contrastive learning for medical time series. Specifically, this paper outlines the pipeline of contrastive learning, including pre-training, fine-tuning, and testing. We provide a comprehensive summary of the various augmentations applied to medical time series data, the architectures of pre-training encoders, the types of fine-tuning classifiers and clusters, and the popular contrastive loss functions. Moreover, we present an overview of the different data types used in medical time series, highlight the medical applications of interest, and provide a comprehensive table of 51 public datasets that have been utilized in this field. In addition, this paper will provide a discussion on the promising future scopes such as providing guidance for effective augmentation design, developing a unified framework for analyzing hierarchical time series, and investigating methods for processing multimodal data. Despite being in its early stages, self-supervised contrastive learning has shown great potential in overcoming the need for expert-created annotations in the research of medical time series.

This repo includes:

The implementation of time series augmentations (Timeseries_augmentations.ipynb) file, this file augments the time series data at sample-level. We will release the code that can achieve augmentation at batch-level and dataset-level later.
An extended summary table of the 43 reviewed papers, including title, author/year, challenges, contributions, scenario/task/findings, datasets, preprocessing/perturbation, model, performance and link to their implementation codes (if publically released).

Citation

If you find this paper useful for your research, please consider citing it:

  @Article{liuself2023survey,
    AUTHOR = {Liu, Ziyu and Alavi, Azadeh and Li, Minyi and Zhang, Xiang},
    TITLE = {Self-Supervised Contrastive Learning for Medical Time Series: A Systematic Review},
    JOURNAL = {Sensors},
    VOLUME = {23},
    YEAR = {2023},
    NUMBER = {9},
    DOI = {10.3390/s23094221}
    }

Extended summary table:

Title	Author (Year)	Challenge	Contribution	Scenario/task/findings	Datsets	Preprocessing/perturbation	Model	Performance	Code
First Steps Towards Self-Supervised Pretraining of the 12-Lead ECG	Gedon et al. (2022)	Discover a supervision signal from the data itself for self-supervised represenation learning	1) Define a self-supervised learning task and pretraining procedure which can learn generalizable features of ECG data, 2) Develop and show that a ResNet based architecture can successfully be used in combination with our learning task.	ECG reconstruction and (anomalies)classification; Pretraining on the CODE training dataset, Use transfer learning with the ECG benchmarks: PTB-XL and CPSC dataset;	CODE, CPSC 2018, PTB-XL		U-ResNet: ResNet + encoder-decoder + channel-wise dense layer + U-Net based skip-connections. Downstream task(classification): encoder (no bottleneck layer, no U-Net skip connections) + linear classifier	AUC： CPSC: +PT: 0.954; PTB-XL: +PT: 0.919	-
Self-supervised representation learning from 12-lead ECG data	Mehari et al. (2022)	Label scarcity in ECG data	1. Comprehensive assessment of self-supervised representation learning for 12-lead ECG data to foster measurable progress. 2. Compare instance-based self-supervised methods and contrastive forecasting methods. 3. Modify the CPC architecture and training procedure for performance improvements. 4. Evaluate downstream classifiers finetued from self-supervised models to training from scratch.	Assessment of self-supervised representation learning from clinical 12-lead ECG data: -data efficiency (downstream performance number of folds used in finetuning); -quantitative performance (macro AUC); -robustness (influence of physiological on downstream performance)	Pretraining: CinC2020, Chapman, Ribeiro Evaluation: PTB-XL		Modified CPC (4FC+2LSTM+2FC); Compared with: Supervised (4FC+2LSTM+2FC) Supervised (xresnet1d50) SimCLR(RRC, TO)(xresnet1d50) SimCLR physio(xresnet1d50) BYOL(RRC,TO)(xresnet1d50) BYOL physio. (xresnet1d50) *Physiological noise, (RRC, TO) are transformations	Macro AUC (on PTB-XL): Modified CPC: Linear: 0.9272 fine-tuned: 0.9418	Link
Semi-Supervised Contrastive Learning for Generalizable Motor Imagery EEG Classification	Han et al. (2021)	Label scarcity in ECG data	1. A semi-supervised framework with a combination of self-supervised contrastive learning and adversarial training. 2. Semi-supervised learning structure with contrastive learning for unlabelled data. 3. Adversarial training to disentangle the subject/session-specific information from the desired MI information in the latent representation.		BCIC IV 2a MI-EEG dataset from the MOABB library	Filtered between 4Hz and 40Hz, converted it into microvolt. all 22 channels of the EEG and the entire 4 seconds of the trial windows. T he EEG windows were then resampled from 250Hz to 128Hz resulting in a length of 512 sample points for each window and processed through channel-wise z-score normalisation.	Augmentation-based contrastive loss + task classification loss + domain discriminator loss EEGNet, DeepConvNet as the encoder	Semi-Deep ConvNet: 10%: 67.6 20%: 74.3 50%: 77.4 100%: 79.4	-
Self-Supervised Representation Learning from Electroence- phalography Signals	Banville et al. (2019)	Supervised models are limited by the cost - and sometimes the impracticality - of data collection and labeling	1. Propose self-supervised strategies to learn end-to-end features from unlabeled time series such as EEG. 2. Two temporal contrastive learning tasks refer to as “relative positioning” and “temporal shuffling”. 3. On a downstream sleep staging task, outperform traditional unsupervised and purely supervised approaches, specifically in low-data regimes.	Demonstrate that contrastive learning tasks based on predicting whether time windows are close in time can be used to learn EEG features that capture multiple components of the structure underlying the data (time windows close in time should share the same label)	Sleep EDF, MASS session3	Both:raw EEG ->4th-order FIR lowpass filter (20-Hz cutoff frequency and Hamming window) MASS: downsampled to 128Hz, extracted non-overlapping 30-s windows, windows were normalized (to focus on Fz,Cz and Oz channels.)	Pre-tain: sample pairs of time windows (RP: x_t,x_t'; TS: triplets: x_t, x_t', x_t'') + feature extractor (CNN) + contrastive module aggregate the feature representation of each window (element-wise absolute difference) Finetuning: feature extractor (CNN) + linear context discriminative model	Average per-class recall: RP: 76.66 TS: 75.9 EEG features: 79.43 Fully supervised: 72.51	-
Anomaly Detection on Electroence- phalography with Self-supervised Learning	Xu et al. (2020)	1. Hand crafted features could omit potentially discriminative feature; 2. Labeling of EEG signals of the state of epilepitc seizures have become bottleneck in applying deep learning; 3. Individual differences of patients with epilepsy and certain abnormal brain activities share with other brain dieases. generalize issue	1. A new self-supervised learning method based on only normal EEG data is proposed particularly for detection of any abnormal signal in EEG data. 2. A simple and effective method is proposed to generate the self-labeled data for self-supervised learning, in which different labels correspond to different scaling transformations on EEG data. 3. Performs significantly better than existing wellknown anomaly detection approaches, and is robust to varying model structures and hyperparameters settings.	Higher-frequency signals in an abnormal EEG data would probably mislead the classifier to predict an incorrect scaling transformation.	UPenn and Mayo Clinic's seizure detection challenge dataset	Generation of self-labeled EEG data: Each sequence of EEG data matrix X_i -> K scalling transformations -> a longer sequence s_k *d (number of values in the original sequence) -> for each scale s_k, all the formed new sequences are collected to form a new scaled EEG data T_k(X_i)	CNN classifier for prediction of scaling transformations: Input: self-labeled dataset; Output: K values, each representing the probability of one scaling transformation. Cross entropy -> classifier output, ground truth scaling transformation (one-hot vector) Anomaly detection: Difference between predicted scaling and ground truth scaling indicates the degree of abnomality of new EEG	AUC: ResNet34: 0.941 Ablition study on kernel shapes (this paper proposed 33 compared to 13): Backbone: ResNet34: 0.943 VGG19: 0.960	-
Contrastive Representation Learning for Electroencephalogram Classification	Mohsenvand et al. (2020)	Hand-crafted feature; deep learning in supervised manner restricts the use of learned features to specific task; labeling EEG is cumbersome and requires years of medical training and experimental design; labeled EEG data is limited and existing dataset are small; existing dataset use incompatible EEG setups (different number of channels, sampling rates, types of sensors, etc.) hard to fuse to larger dataset for unsupervised learning.	1. Combine multiple EEG datasets, 2. Use the uderlying physics of EEG signals to multiply the number of samples (quadratic increase), 3. Learn representations in a self-supervised manner via contrastive learning without requiring labels.	1. Emotion recognition. 2. Normal/abnormal classification. 3. Sleep-stage scoring.	1. SEED dataset (ER) 2. TUH dataset (NAC) 3. SleepEDF (SSS)	Channel recombination: By subtracting two channels, one obtains a new channel that represents the voltage difference between the two sensors, resulting in another physiologically valid channel. Preprocessing: resampled all datasets to 200Hz and applied a fifth-order band-pass Butterworth filter (0.3-80 Hz). removed the channels that involved voltages higher than 500 μVs as they normally represent artifacts. To train the encoder, cut the channels into chunks of 20 seconds	Channel augmenter: each channel, randomly applies two of the augmentations to form a positive pair. Channel encoder: recurrent encoder, convolutional encoder. Projector: downsampling and bidirectional LSTM units--each direction output concatenated and fed into dense layers with a ReLU activation in between. Contrastive loss: NT-Xent. Downstream tasks: Classifier: discard the projector and use a classifier almost identical to the projector:	fine-tuned SeqCLR: 1. (C) 50%: 85.21 2. (C) 50%: 87.21 3. (R) 50%: 83.72	-
Forecasting adverse surgical events using self-supervised transfer learning for physiological signals	Chen et al. (2021)	1. Availability of training data, lack sufficient data or computational resources. 2. Patient privacy considerations mean that large public EHR datasets are unlikely, leaving many institutions wiht insufficient resources to train performant models on their own.	Improves predictive accuracy by leveraging deep learning to embed physiological signals. using LSTMs, embeds physiological signals prior to forecasting adverse events with a downstream model. Shares models rather than data to address data insufficiency and improves over alternative methods. By transferring performamt models as has been done in medical images and clinical text, scientists can collaborate to improve the accuracy of predictive model without exposing patient data.	Utilize fifteen physiological signal variables and six static variable inputs to forecast six possible outcomes: hypoxemia, hypocapnia, hypotension, hypertension, phenylephrine administration, and epinephrine administration.	Two OR datasets (private) ICU dataset (MIMIC dataset)		LSTM for representation learning, followed by fully connected layer as downstream predictor. Use observations in previous 1 hour to predict next 5 mins.	-	Link
T-DPSOM: An Interpretable Clustering Method for Unsupervised Learning of Patient Health States	Manduchi et al. (2021)	Traditional clustering methods have poor performance on high-dimensionality dataset -> dimensionality reduction and deature transformation to obtain low-dimentional representation of the raw data (easier to cluster) -> cluster feature lie in a latent space, can not be easily visualized or interpreted or investigating the relationship between clusters. Self-Organizing Map is a clustering method that provides such an interpretable representation.	1. A deep clustering architecture conbines a VAE with a novel SOM-based clustering objective. 2. An extension of this architecture to time series, improving clustering performance, enabling temporal forecasting. 3. Showing superior performance on static image data and medical time series (ICU). 4. Cluster patientis into different sub-phenotypes and gain better understanding of disease patterns and individual patient health states.	A useful tool to understand and track patient health states in the ICU.	MNIST, Fashion-MNIST, eICU dataset	For eICU: use vital sign(d=14) and lab measurements(d=84) resampled to a 1-hour based grid using forward filling with population statistics from training set if no measurements were available prior to the time point. From ICU stays: 3 days<include< 30days, or has gap in continuous vital sign monitoring. Overall data dimension d=98. The last 72 hours of multivariate time series were used for the experiments. As labels, use a variant of the current dynamic APACHE.	A data point 𝑥𝑖 is mapped to a continuous embedding 𝑧𝑖 using a VAE. In T-DPSOM, the embeddings 𝑧𝑖,𝑡 for 𝑡 = 1,...,𝑇 are connected by an LSTM, which predicts the embedding 𝑧𝑡 +1 of the next time step.	clustering NMI: 0.1115 +-0.0006	Link
CLOCS: Contrastive Learning of Cardiac Signals Across Space, Time, and Patients	Kiyasseh et al. (2021)		1. Propose a family of patient-specific contrastive learning methods, that exploit both temporal and spatial information present in ECG signals. 2. Outperforms state-of-the-art methods, BYOL and SimCLR, when performing a linear evaluation of, and fine-tuning on, downstream tasks involving cardiac arrhythmia classification.	Downstream task: cardiac arrhythmia classification Human physiology where abrupt changes in cardiac function (on the order of seconds) are unlikely to occur. multiple leads (collected at the same time) will reflect the same underlying cardiac function.	PhysioNet 2020, Chapman, PhysioNet 2017, Cardiology	Gaussian, Flip, SpecAugment	Pre-train: Contrastive Multi-segment Coding; Contrastive Multi-lead Coding, Contrastive Multi-segment Multi-lead Coding Downstream task: 1)Linear Evaluation of representation (pre-train, fine-tune same dataset); 2)Transfer capabilities of representations (pre-train, fine-tune different dataset)	AUC: 1)CMSC (Chapman): 0.896+-0.005 CMSC (PhysioNet2020): 0.715+-0.033 2)CMSC( Chapman+ PhysioNet2020): 0.83+-0.002 CMSC (PhysioNet2020+ Chapman): 0.932+-0.008 CMSMLC (PhysioNet2020+ PhysioNet2017): 0.774+-0.012	Link
Segment Origin Prediction: A Self-supervised Learning Method for Electrocardiogram Arrhythmia Classification	Luo et al. (2021)	1. Lack of well-annotated labels, 2. Compared to random weight initlization, pre-trained model weights can help to allivate overfitting	Develop a new augmentation: reorganization.	Single-lead ECG classification: heart arrithmia detection	PhysioNet2017, CPSC2018	Discrete wavelet transform (DWT) for denoising	One framework with 6 different methods as encoder structure. Innovation: a new augmentation (reorganization). Take two ECG segments/peaks from a pool of segments: if the two taken segments are from the same recording, assign it psudo label 1; otherwise, assign psudo label 0. A classifier for psudo label prediction serves as supervision signal for pretraining.	PhysioNet2017 for pre-train; CPSC2018 for fine-tuning/ test. F1 score: 0.875	-
Learning Unsupervised Representations for ICU Timeseries	Weatherhead et al. (2022)	1. Lack of labels in ICU time series 2. Allivate the effect of severe data imbalance	1. Improved TNC model by using autocorrelation encoding-based neighborhood defining. 2. Overcame the negative sampling bias, i.e., the selected negative sample (far away from target sample) could have the same label with the target sample	ICU scenarios: mortality, diagnostic groups, circulatory failure, cardiopulmonary arrest	HiRID dataset (public), High-frequency ICU (private)		Based on TNC: neighboring samples are regarded as positive, otherwise negative. The neighborhood is calculated/defined by autocorrelation encoding (based on Pearson correlation)	F1 score: 0.59 in HiRID mortality; 0.61 in diagnostic group, 0.56 in circulatory failure, 0.77 in cardiopulmonary arrest	-
CROCS: Clustering and Retrieval of Cardiac Signals Based on Patient Disease Class, Sex, and Age	Kiyasseh et al. (2021)	Given a large, unlabelled clinical database, 1. How do we extract attribute information from such unlabelled instances? 2. How do we reliably search for and retrieve relevant instances?	1. A supervised contrastive learning framework, attracts representations of cardiac signals associated with a unique set of patient attributes to embeddings, entitled clinical prototypes. 2. Outperforms DTC, in the clustering setting and retrieves relevant cardiac signals from a large database. At the same time, clinical prototypes adopt a semantically meaningful arrangement and thus confer a high degree of interpretability.	Clinical representation learning and clustering(setting 1), Clinical information retrieval(setting 2)	Chapman, PTB-XL	Chapman: cardiac arrhythmia labels-> group into 4 major classes PTB-XL: disease label -> group into 5 major classes. Each dataset contains patient sex and age information and is split, at the patient level, into training, validation, and test sets. Each time-series recording is split into non-overlapping segments of 2500 samples (≈ 5s in duration), as this is common for in-hospital recordings.	Supervised clustering. ResNet18	Clustering: 1. cardiac arrhythmia class attribute: CP CROCS( Chapman) acc: 90.3: CP CROCS (PTB-XL) acc: 76.0 2. Sex and age attributes: Chapman: CP CROCS(sex): 57.4; (age): 38.0 PTB-XL: CP CROCS(sex): 73.5; TP CROCS(age): 39.4 Retrieval: check paper	-
Self-Supervised Graph Neural Networks for Improved Electroencephalographic Seizure Analysis	Tang et al. (2022)	1. Representing non-Euclidean data structure in EEGs, 2. Accurately classifying rare seizure types, 3. Lacking a quantitative interpretability approach to measure model ability to localize seizures.	1. Representing the spatiotemporal dependencies in EEGs using a GNN and proposing two EEG graph structures that capture the electrode geometry or dynamic brain connectivity, 2. Proposing a self-supervised pre-training method that predicts preprocessed signals for the next time period to further improve model performance, particularly on rare seizure types, 3. Proposing a quantitative model interpretability approach to assess a model’s ability to localize seizures within EEGs.	Seizure detection and classification Use self-supervised pre-training: predict future 12 seconds to learn task-agnostic representations and improve downstream task (detection and classification) performance	Temple University Hospital EEG Seizure Corpus (TUSZ), a in-house dataset	Transform raw EEG to the frequency domain, and obtain the log-amplitudes of the fast Fourier transform of raw EEG signals. Detection and self-supervised pre-training: use both seizures and non-seizure EEGs, obtain the 12-s(60-s)EEG clips non-overlapping 12-s(60-s) sliding windows. Classification: use only seizure EEGs and obtain one 12-s(60-s) EEG clips from each seizure event(such that each EEG clip had exactly one seizure type), use a refined seizure classification scheme: four seizure classes in total.	Augmentation: a) randomly scaling, b) randomly reflecting the signals along the scalp midline. Distance graph: represents the natural geometry of EEG electrodes, compute edge weight by applying a thresholded Gaussian kernel to the pairwise Euclidean distance between electrodes. Correlation graph: capture dy namic brain connectivity, define the edge weight as the absolute value of the normalized cross-correlation between the preprocessed signals. Encoder: DCGRU-Diffusion Convolutional Gated Recurrent Units	With pre-training: Seizure detection AUROC: Dist- DCRNN(12s): 0.866+-0.016 Dist- DCRNN(60s): 0.875+-0.016 Seizure classification weighted F1-score: 12s: Dist-DCRNN: 0.746+-0.024 60s: Corr-DCRNN: 0.749+-0.017 Dist-DCRNN: 0.749+-0.028	Link
Domain-guided Self-supervision of EEG Data Improves Downstream Classification Performance and Generalizability	Wagh et al. (2021)	Can we make encoders learn desirable physiological or pathological features through bespoke pretext tasks?	1. Propose SSL tasks for EEG based on the spatial similarity of brain activity, underlying behavioral states, and age-related differences; 2. Present evidence that an encoder pretrained using the proposed SSL tasks shows strong predictive performance on multiple downstream classifications; 3. Using two large EEG datasets, show encoder generalizes well to multiple EEG datasets during downstream evaluations.	Downstream tasks: EEG grade(normal, abnormal), eye state(eye open, eyes closed), age(young, old), and gender(male, female) classification	TUH EEG Abnormal Corpus(TUAB), MPI LEMON	Pre-text task: Hemipheric symmetry(HS): aug1-randomly flipping, aug2-add Gaussian noise; Behavioral state estimation(BSE): DBR-delta-beta power ratio(proxy measure of the subjects's behavioral state); Age contrastive(AC): a triplet training tuple constructed from 3 EEG epochs:(X,X+,X_), similarity measured by Euclidean distance, triplet loss. (same age group labeled similar).	Pre-training: represented the EEG epochs by 2D images (topographical map of the spectral power in a brain rhythm band) ->Resnet-18 backbone(feature extractor) ->three linear layers(projector) -> three SSL pre-text task layer -> multi-task loss Fine-tuning: (Resnet-18 backbone -> linear layer) x 4 (four dowmstream tasks)	Binary classification (AUC): TUH: BSE only (eeg grade): 0.918(3e-4) LEMON: BSE-AC(Age): 0.987(1e-3) HS-BSE-AC (Gender): 0.803(8e-3)	Link
CLECG: A Novel Contrastive Learning Framework for Electrocardiogram Arrhythmia Classification	Chen et al, (2021)	Lack of annontations in ECG	Contrastive learning framework for ECG pre-training	Heart arrhythmia detection	PTB-XL for traning, ICBEB2018 and PhysioNet 2017 for fine-tuning		Augmentation: Daubechies wavelet transform, random crop/drop. Encoder: xresnet101 backbone +MLP projection head	F1 0.788 for PhysioNet2017; 0.942 (F1) on ICBEB2018	-
Self-Supervised Learning with Electrocardiogram Delineation for Arrhythmia Detection	Lee et al., (2021)	Lack of annontations in ECG	Propose a mixed schematic diagram by combining self-supervised representations and manually extracted features for ECG delineation	Heart arrhythmia detection	CPSC, PT-BXL, Shaoxing-Chapman		m-ResNet architecture	F1: With 10% labels, 69.18 for CPSC, 66.86 for PTB, 81.49 for Shaoxing	-
Towards Parkinson’s Disease Prognosis Using Self-Supervised Learning and Anomaly Detection	Jiang et al. (2021)	1. No enough label to detect PK 2. PD is chronic disease that last for long time, the positive samples could be very diverse as they are collected span a long period.	Form PD detection as a task of anomaly detection. Use contrastive learning to learn representations unsupervisely, then detect PD with anomaly detection model.	PD detection	mPower data	Sensory signals are downsampled to 10% of original sampling rate, to reduce high frequency noise	CPC for SSL pre-training, One-Class Deep SVDD for anomaly detection.	AUC: 67.3	-
Detection of maternal and fetal stress from the electrocardiogram with self-supervised representation learning	Sarkar et al. (2021)	DL's utility in non-invasive biometric monitoring during pregnancy not well studied	1. Validated the chronic stress exposure by psychological inventory, maternal hair cortisol and FSI(Fetal Stress Index). 2. Tested two variants of SSL architecture, one trained on the generic ECG features for emotional recognition obtained from public datasets and another transfer learned on private data. 3. Provides a novel source of physiological insights into complex multi‐modal relationships between different regulatory systems exposed to chronic stress.	Detection of maternal and fetal stress from abdominal ECG (the aECG was deconvoluted into fetal and maternal ECG-fECG, mECG)	AMIGOS, DREAMER, WESAD, SWELL, FELICITy (private)	Performed minimal pre-processing on the raw data. re-sampled ECG signals to a sampling frequency to 256 Hz, segmentation into 10-s windows. To remove the noisy parts of aECG and mECG data, utilized the SQI values available with the segments, SQI < 0.5 were discarded. resulted in removing approximately 4.1% of total acquired data with a standard deviation.	Transformations: noise addition, scaling, negation, temporal inversion, permutation, time-warping 1. Signal transformation recognition network (pre-train) Transformed ECG -> three convolutional blocks, each consists of two 1 D convolution layers with ReLU and a max pooling layer -> global max pooling -> several fully connected layers 2. Affective recognition network (fine-tune) Raw ECG-> Frozen network ->flatening layer -> several FC layers -> classification task & regression tasks	Classification (Detection of stressed mothers): AUROC: FELICITy dataset: (mECG) 0.931 Public dataset (transfer learning: public+private dataset): (mECG) 0.982 Regression (Prediction of biomarkers): Public datasets: (mECG) Cortisol: 0.931; FSI: 0.946; PDQ: 0.961; PSS: 0.943.	Link
Self-supervised transfer learning of physiological representations from free-living wearable data	Spathis et al. (2021)	1. Label scarcity problem in wearable data; 2. Multimodal learning approaches rely on the modalities being used as parallel inputs, limiting the scope of the resulting representations.	1. The new pre-training task forecasts ECG-level quality HR in real-time by only utilizing activity signals, 2. Leverage the learned representations of this model to predict personalized health-related outcomes through transfer learning with linear classifers.	Set HR responses as the supervisory signal for the activity data, predict personalized health-related outcomes	The Fenland study (not public, but can request)	Heart rate-noise removal, accelerometer data: auto-calibrated to local gravity, non-wear time was inferred and participants with less than 72 hours of wear were removed. Magnitude of acceleration was calculated through the Euclidean Norm Minus One and the high-passed fltered vector magnitude. Both the accelerometry and ECG signals-summarized to a common time resolution of one observation per 15 seconds. encoded the sensor timestamps using cyclical temporal features.	Input: X (sensors), M (metadata), y (target HR) Output : E ̃ (user-level embedding), y ̃ (target variable) network: pass X through CNN & GRU layers; pass M through reLU layers; concatenate outputs in E; forecast & backpropagate with joint loss L; use trained network to extract embeddings E; aggregate E to the user-level E ̃ with average pooling; train a linear model to predict target variables y ̃; Downstream: traditional classifier	(A/R/T)= acceleration features/resting heart rate/temporal features outcome: sex AUC: step2heart (A/R/T): 93.4 outcome: height AUC: step2heart (A/T): 82.1	Link
Supervised and Self-Supervised Pretraining Based Covid-19 Detection Using Acoustic Breathing/Cough/ Speech Signals	Chen et al. (2022)	The amount of COVID-19 audio data in each sub-task (breath/cough/speech) is still limited, the traditional MFCC feature might be not sufficiently representative for classification tasks.	1. A supervised pre-training method, the model uses breath, cough and speech to train three different models and obtain an average model (used as an initialization model). 2. A self-supervised pre-training method, use the pre-tained wav2vec2.0 model to extract high-level features, which are input into the diagnosing model to replace the classic MFCC feature. 3. Ensemble the scores obtained by the two models	COVID-19 detection (binary classification)	DiCOVA-ICASSP 2022 challenge dataset	The amplitude of the raw waveform is normalized between -1 to 1, cut off silent segments, sound data is downsampled to 16 kHz, forty dimensional MFCC and delta-delta coefficients and extracted with a window of 25 msec audio samples and a hop of 10 msec. use SpecAugment time-frequency mask to augment the data (due to small size of the training data)	Model: two bi-directional LSTM layers(encoder) + two linear transformations with a ReLU activation in between(classifier) Supervised pre-train: average model (average three BiLSTM task model) as initialize of classifier. Self-supervised pre-training: wav2vec2.0 model (raw waveform -> a CNN based encoder + a transformer encoder + a quantization model discretizeds the output of feature encoder as targets in the contrastive objective.) Ensemble: train two models, ensemble the scores.	AUC: 88.44 on blind test in the fusion track	-
Contrastive Predictive Coding for Anomaly Detection of Fetal Health from the Cardiotocogram	de Vries et al. (2022)	Low availiability of pathological data along with the high variability in pathologies and a scarcity of available labels	1. Extended the original CPC model by making stochastic, recurrent, and conditioned (upon uterine contractions) predictions, and a custom loss function. 2. Based on the detection of out-of-distribution behaviour and deviations from subject-specific behaviour, the proposed model is capable of achieving promising results for identification of suspicious and anomalous FHR events in the CTG.	Detection of fetal health from CTG * CTG provides a temporal recording of both the Fetal Heart Rate (FHR) and Uterine Contractions (UC)	Dutch STAN trial, a healthy dataset	Fatal heart rate signals and toco data were pre-processed to yield a constant sampling frequency of 4Hz by means of linear interpolation and subsequently normalized using the mean, and 98th percentile of the healthy dataset. Before normalization, toco signal was filtered by a zero-phase, 4th order Butterworth bandpass-filter with cut-off frequencies at 0.001 and 0.1 Hz. (to eliminate offset and high-frequency noise)	Contiditional CPC (Contrastive Predictive Coding) GRU (encoder) + 3 layer MLP (predictor) Use three past windows to predict K=4 Nagetive pair: same signal at different time Training: only use the data of healthy childern.	AUC: 0.96 (normal vs anomalous) average correlation coefficient of 0.8+-0.13 with respect to expert annotations	-
Self-Supervised Learning for Anomalous Channel Detection in EEG Graphs: Application to Seizure Analysis	Ho et al. (2022)	Lack of access to the labeled seizure data	1. A self-supervised method for identifying abnormal brain regions and EEG channels without access to the abnormal class data during the training phase. 2. Model brain regions and their connectivities using attributed graphs. 3. Employing contrastive and generative learning, propose an augmentation approach to create the positive and negative pairs to form contrastive and generative loss. 4. Define a channel-based anomaly score function (linear combination of the contrastive and reconstruction loss)	Serizure detection (no access to the seizure data is needed)	TUSZ	For a given eeg clip, build four types of EEG graphs: Dist-EEG-Graph: use Euclidean distance between electrodes, embed the structure of electrode locations in the graph's adjacency matrix. Rand-EEG-Graph: randomly connection of nodes(assume all electroes are connected and eqyally contribute in brain activities, so every edge has the chance of present in the graph) Corr-EEG-Graph: functional connectivity between electrodes (cross-correlation function, top-3 neighborhood nodes) DTF-EEG-Graph: directed transfer function graph, functional connectivity of the brian regions.	Positive and negative pair sampling: 2 positive & 1 negative sub-graphs for every node in every constructed EEG graph(positive:first selected an electrode as target node, target code anonymize in positvie subgraph(replace its feature vector with an all-zero vector); negative:first find the farthest electrode from the target node) Contrastive learning model: pairs-> GNN encoder -> all embeddings -> take avg over rows -> obtian similiarity -> contrastive loss; Generative learning model: (GNN encoder) -> positive embeddings -> GNN decoder(constracting the target node anonymized in the positive subgraphs, using other node features and edges) -> reconstruction loss;	Specificity: EEG_t-CGS: 0.989 * EEG_t refers to all four graph types are concatenated and fed to the system as the input representing the given EEG clip	Link
A Contrastive Predictive Coding-Based Classification Framework for Healthcare Sensor Data	Ren et al. (2022)	Annotating data consume a large amount of manpower and resources	1. Designing a contrastive predicting coding(CPC)-based pretext task for medical sensor data classification, redesigning the arrangement of positive sample pairs and negative pairs. 2. Design a lightweight downstream classification model, further improve the classification accuracy.	1. Sleep stage classification 2. Arrhythmia classification	Sleep-EDF, MIT-BIH-SUP	Positive sample pair contians 8 different samples belonging to the sample category, and the four left and four right of the negative sample pair belong to the same categories, but the left and right are different categories.	Pretext: predict future (GRU) CPC based model Encoder: four blocks, each block: a dense layer, a batch normalization layer, an activation layer, a dense layer. Classification: 2 Conv1D layers,	Sleep: macro avg ACC: 88.7% Arrhythmias: ACC: 97.3%	-
A Contrastive Learning Framework for ECG Anomaly Detection	Li et al. (2022)	1. Unbalanced data 2. Lack robustness due to inconsistent ECG data representation	1. Effective sequence data augmentation methods are introduced to ECG signal abnormal detection, aiming at alleviating category imbalance. 2. A new contrastive learning framework that address the challenge of inconsistent data representation during model learning, improve rubustness and accuracy.	ECG anomaly detection	MIT-BIH arrhythmia dataset, PTB	ECG signals were preprocessed and segmented. with each segment corresponding to one heartbeat. Augmentation: two methods: BiLSTM-CNN, TimeGAN, (both used in this model)	Contrastive learning: Input->BiLSTM&TimeGAN-> Encoder-> Transformer(based on attention mechanism with efficient parallel computing capabilities)-> Non-linear projection head->Maximize similarity Detection: input-> 2 layers of (Conv+Batch Norm) -> max pool -> transformer	Arrhythmia: ACC:96.3% PTB diagnostic ECG: ACC: 94.5%	-
Listen to your heart: A self-supervised approach for detecting murmur in heart-beat sounds for the Physionet 2022 challenge	Ballas et al. (2022)	Lack of labels in ML tranining	Propose two augmentation combinations to construct effective positive pairs	Murmur classification, and clinical outcome classification	PhysioNet 2016 and PhysioNet2022 challenge datasets	5 sec is a window, 50% overlapping Augmentation: View1:250Hz high pass filtering View 2: pollute with uniform noise and then upsampling with 0.5 probability	CNN as encoder, 3-layer MLP as prediction head.	0.606 in F-score in murmur classification; 0.657 in outcome classification (F1)	-
Weak self-supervised learning for seizure forecasting: a feasibility study	Yang et al. (2022)	Reduce the burden of manucal labeling	Perform a feasibility study on seizure predeciton, which is identified as an ideal test case, as pre-ictal brainwaves are patient-specific, and tailoring models to individual patients is known to improve forecasting performance significantly.	Seizure detection and forecasting	TUH seizure, EPILEPSIAE dataset, RPAH dataset (pravite)	12s window, ICA and STFT are applied to the EEG before pre-trianed seizure detection. ICA is used for removing EOG artefact. STFT is then applied to the clean EEG with a 250 sample window(1s) and 50% overlap. DC component removed. Same preprocessing used on EEG for prediction.	Forecasting model: pre-trained with EPLIEPSIAE, Detection model: pre-trained with TUH, Both model: 3 layers of ConvLSTM, 2 layers of FC (with sigmoid). All three tests, both pseudo-prospectively inference-only real-time tested on the RPAH dataset.	Average relative improvement in sensitivity by 14.3%, a reduction in false alarms by 19.6% in early seizure forecasting.	Link
Contrastive Heartbeats: Contrastive Learning for Self-Supervised ECG Representation and Phenotyping	Wei et al. (2022)	High cost of manual labels	Propose a contrastive learning approach, to utilize the periodic and meaningful patterns from ECG.	Cardiac arrhythmia classification	MIT-BIH, Chapman, private large-scale ECG dataset	Exclude samples with <48 bpm, within the ten-second measurement; Positive pair: the anchor heartbeat with a positive heartbeat(sample from the same ECG); Negative pair: the anchor heartbeat with a negative heartbeat(sample from other ECG).	Heartbeat extract in the full-length ECG by the Hamilton R-peak segmentation algorithm; Backbone model: Causal CNN; Projector: additional fully connected layer(project the features of the anchor); Loss: multi-similarity loss.	Linear evaluation on: MIT-BIH: ACC: 89.25; (AUROC= 0.9424) Chapman: AUROC: 0.920 Semi- supervised learning: (finetune use partial labels) MIT-BIH: ACC: (50%) 0.9461	-
Practical cardiac events intelligent diagnostic algorithm for wearable 12-lead ECG via self-supervised learning on large-scale dataset	Yang et al. (2022)		1. Collected 658,948 ECG, 164,538 were diagnosed, and the remaining 493.948 ECGs were without diagnosis. 2. Train a Siamese network via contrastive learning, transferred the pretained weights to downstream classification. 3. Designed four data augmentation operations for 1D digital myltilead ECG signals.	Cardiac events diagnostic (55 cardiac events)	CPSC 2018, large-scale ECG dataset (can not be open-scourced)	5th order Butterworth high-pass filter, with the lower cutoff frequency of 0.5 Hz. Data augmentation: 1. frequency dropout; 2. crop resize; 3. cycle mask: detect the position of R peak and segment the same position in each heartbeat to zero; 4. channel mask.	Momentum contrast(MOCO): an encoder and a momentum encoder, and a projection head at the bottom of each encoder.	On CPSC 2018: F1 score: 0.839	Link
As easy as APC: overcoming missing data and class imbalance in time series with self-supervised learning	Wever et al. (2021)	High levels of missing data and strong class imbalance	Demonstrate how Autoregressive Predictive Coding (APC), can be leveraged to overcome both missing data and class imbalance simultaneouly without strong assumptions.	Overcome high missingness and severe class imbalance	Synthetic dataset, Physionet challenge 2012, menstrual cycle tracking app Clue		Encoder: GRU-D (GRU Decay) APC MaskedMSE	Physionet2012 (binary): AUROC: (GRU-APC without class imbalance method): 86.0+-0.5 Clue dataset (multi-class classification): weighted F1: (GRU-APC): 90.7+-0.1	Link
DeepClean: Self-Supervised Artefact Rejection for Intensive Care Waveform Data Using Deep Generative Learning	Edinburgh et al. (2020)	Waveform physiological data in ICU are susceptible to artefacts, removal of artefacts reduced bias and uncertainty in clinical assessment and false positive rate of ICU alarms.	1. A prototype self-supervised artifact detection system using a convolutional variational autoencoder deep neural network that avoids manual annotation, requiring only easily-obtained good data for training. 2. Can identify regions of artefact with high accuracy.	Artefact detection on ICU waveform physiological data	ABP waveform data from single anonymised patient throughout a stay	Split the data into 100-second windows, normalising across the whole dataset, sampled uniformly within the selected windowto generate 10-second sample to join the test set (main contain marked(abnormal marked by expert) sample).	VAE with CNNs for both encoder and decoder.	Accuracy: (mean) VAE: 0.901 ROCAUC: 0.973	Link
SOM-CPC: Unsupervised Contrastive Learning with Self-Organizing Maps for Structured Representations of High-Rate Time Series	Huijben et al. (2022)	High-dimensional real-world data are difficult to interpret. Deep learning aim to identify this manifold, but do not promote structure nor interpretability.	1. SOM-CPC, suitable for learning structured and interpretable 2D representations of high-rate time series by encoding subsequent data windows to a topologically ordered set of quantization vectors. 2. Requires far less auxiliary loss function (and associated hyperparameter tuning)	Clustering	Synthetic dataset, subset 3 of MASS, subset of LibriSpeech dataset	For MASS: select three EEG channels(F4, C4, O2), two EOG channels, one chin EMG derivaiton, downsampled to 128Hz, non-overlapping 30-second window. Before downsampling, all derivations filtered with a zero-phase 5th order Butterworth band-pass filter, another zero-phase 5th order Butterworth notch filter. Channels normalized within-patient and per channel, yielding mean substraction, and normalization.	SOM-CPC Encoder: CNNs (details in appendix)	On sleep dataset: Purity: 0.79 NMI: 0.28 Cohen's kappa: 0.67 l_2 smooth: 1.22+-0.21 TE: 0.042	-
Subject-aware contrastive learning for biosignals	Cheng et al. (2020)	Dataset for biosignals, limited labels and subjects	1. Apply self-supervised learning to biosignals. 2. Develop data augumentation techniques for biosignals. 3. Integrate subject awareness into the self-supervised learning framework. 1) subject-specific distribution to compute contrastive loss 2) promoting subject invariance through adversarial training	EEG decoding, ECG anomaly detection	Physionet Motor Imagery, MIT-BIH arrhythmia	Raw EEG/ECG data for input. Data transformations: temporal cutout, temporal delays, noise, bandstop filtering, signal mixing, spatial rotation(exception), spatial shift(exception), sensor dropout, sensor cutout(exception)	Encoder and momentum encoder: 1d ResNet with ELU activation and batch normalization. Project head and momentum project head: 4-layer fully-connected network. Linear classification using logistic regression with weight decay.	EEG: ACC Intersubject: 81.6+-0.8 (subject- specific, 2 class) Intrasubject: 79.6+-2.3 (subject- invariant, 2 class) ECG: Overall: ACC: subject- specific: 93.2+-1.6	-
Sense and learn: Self-supervision for omnipresent sensors	Saeed et al. (2021)	Non-generalizble representations; Lack of annotations	1. Propose 7 data augmentation schemes 2. Design a framework that uses all 7 schemes at the same time to learn generalizable representations	EEG, EOG, Heart rate, Skin conductance, accelerometer, gyroscope	HHAR, MobiAct, MotionSense, UCI HAR, HAPT, Sleep-EDF, MIT DriveDb, WiFi CSI	Blend detection, Fusion magnitude prediction, Feature prediction from masked window, Transformation recognition, Temporal shift prediction, Modality denoising, Odd segment recognition	CNN as backbone	Kappa scores. HHAR: 0.826, MobiAct: 0.89, MotionSense: 0.907, UCI HAR: 0.888; HAPT: 0.820; Sleep-EDF: 0.702; MIT DriveDb: 0.804; WiFi CSI: 0.798	-
Self-Supervised Learning From Multi-Sensor Data for Sleep Recognition	Zhao et al. (2020)	1. Most of sleep recognition methods are limited to single-task recognition, which only involve single-modal sleep data. 2. Shortage and imbalance of sleep samples.	1. Study the problem of sleep recognition at three levels: sleep position/sleep stage recognition, insomnia detection. 2. Self-supervised sleep recognition model(SSRM) is proposed for multi-sensor sleep recognition.	Sleep position/sleep stage recognition, insomnia detection	Sleep Bioradiolocation dataset, Pressure Map dataset, PSG dataset	Normalize to [0, 1] For pressure map: rotation and frequency-domain feature extraction to generate temporary labels. For PSG: preprocess and extract four-dimensional feature and count feature.		Prediction probability of CRF as the final accuracy. Bio-radar: 99.03 Pressure-e1: 99.55 Pressure-e2: 98.92 PSG-2class: 95.91 PSG-3class: 78.69 PSG-4class: 71.01	-
Contrastive Embeddind Learning Method for Respiratory Sound Classification	Song et al. (2021)	1. Difficulty of collectionand expensive manual annotation, only limited samples availabe. 2. Do not explicitly encourage intra-class compactness and inter-class separability between the learned embeddings.	Propose a contrastive embedding learning method, input a contrastive tuple, learn the slight differences among similar samples, the easily confused samples are more likely to be identified.	Respiratory sound classification	ICBHI 2017	Resample audio recordings to 16kHz and segment them into respiratory circles according to onsets and offsets. Convert the circles to 46-dimension log Melspectrograms with a window size of 1024 over a 256-sample hop.	Augmentation: white noise adding, time shifting, time stretching and pitch shifting Encoder: CNN Classifier: linear layer (logistic regression)	ACC: 78.73	-
A Semi-Supervised Algorithm for Improving the Consistency of Crowdsourced Datasets: The COVID-19 Case Study on Respiratory Disorder Classification	Orlandic et al. (2022)	Labelling inconsistencies and label sparsity in the crowdsourced dataset. (1. potentially noisy user label, 2. often contradictory expert labels)	1. Provide an automated approach for increasing the labeling quality of biosignal datasets. 2. The subsample of cough audio recordings identified through our SSL approach was made public	Respiratory disorder classification/COVID-19 detection	COUGHVID dataset	A cough classifier was used to remove non-cough recordings. Normalization (4-order Butterworth lowpass filter; cutoff 6kHz) to reduce high-frequency noise. Isolate each individual cough event. Discard any cough-sound candidates shorter than 200ms, include 200ms before and after the cough candidate in each segment.	Supervised(classifier): user model(based on user label), expert 1,2,4 model(based on labels of experts1,2 and 4) SSL model: majority agreement combines the knowledge form both users and experts, to identify a subset of high-confidence samples->used to train on final classifier, the rest were discarded. User: Linear discriminant analysis; Expert1,2,4: Logistic regression; SSL: Logistic regression.	SSL: Test AUC 0.763	-
BENDR: Using Transformers and a Contrastive Self-Supervised Learning Task to Learn From Massive Amounts of EEG Data	Kostas et al. (2021)	Less of generability: task-specific model is required	Propose a framework with contrastive pre-tranining, it can be used to different tasks/datasets.		MMI, BCIC, ERN, SSC, P300	Augumentation: CPC (predict the future)	CNN+Transformer-based CPC	MMI: (86.7 in BAC), BCIC: 42.6 in Accuracy, ERN 0.65 in AUROC, SSC: 0.72 in BAC; P300: 0.72 in AUROC	-
Unsupervised Anomaly Detection on Temporal Multiway Data	Nguyen et al. (2020)	Unsupervised temporal models employed thus far typically work on sequences of feature vectors, and much less on temporal multiway data.	1. Investigate the applications of matrix recurrent neural networks for unsupervised anomaly detection for temporal multiway data. 2. Two anomaly detection settings (reconstruction and prediction) are examined, and the empirical results on synthetic data, moving digits and ECG readings are reported.	Temporal multiway anomaly detection (looks for irregularities over space-time) Use reconstruction loss: an abnormal sequence does not exhibit the regularities, it is hardly compressible, and thus its reconstruction error is expected to be higher than the error in the normal cases.(if a sequence is regular (normal), the history may contain sufficient information to predict several steps ahead)	Synthetic data, MNIST, MIT-BIH Arrhythmia dataset	For MIT-BIH: manually pick 38 subjects (have both MLII and V1 channels a nd no paced beats). For each univariate signal, the raw ECG is detrended by first fitting a 6-order polynomial and then subtracting it from the signal, a 6-order Butterworth bandpass filter with 5Hz and 15Hz range, filtered signals are normalized individually by Z-score normalization.	Pre-training: (Matrix) LSTM AutoEncoder model: encoder: matLSTM (compresses X into C by reading one matrix at a time） decoder: matLSTM decompresses the memory by predicting one matrix at a time anomaly: reconstruction loss Fine-tuning: (Matrix) LSTM Encoder-Predictor predictive model: anomaly score: mean prediction error	matLSTM: (for predicting 5 heartbeats) AUC: 92.5±0.1 F1: 72.8±0.2	-
Self-supervised EEG Representation Learning for Automatic Sleep Staging	Yang et al. (2021)	1. Unlabeled and noisy data. 2. Existing negative sampling strategies often incur sampling bias.	1. Pretext task: address the inherent limitations of negative sampling in the existing self-supervised methods (e.g., MoCo2, SimCLR3) by leveraging global data statistics. 2. strengthen our model with an instance-aware world representation for each sample, where closer samples are assigned larger weights.	Sleep stage classification	SHHS, Sleep EDF, MGH Sleep	Subjects are randomly assigned to the pretext group, training group, test group with different proportions. Augmentation: Bandpass Filtering, Noising, Channel Flipping, Shifting. ContraWR: Contrast with the World Representation(generate an average representation of the dataset, 𝒛𝒘 as the only contrastive information.) ContraWR+: Contrast with Instance-aware World Representation (weighted average of the world/dataset, where the weight is set higher for closer samples.)	Classifier: training a separate logistic regression model (on top of the encoder) on data from the training group (during which the encoder is frozen) and test on new recordings. Projector: 2-layer fully connected network. Encoder: STFT (Short-Time Fourier Transforms) module, resulting STFT spectrogram passes convolutional layer with batch normalization (CNN-based encoder is built on top of the spectrogram)	5 class classification ContraWR+: Sleep EDF: 86.90±0.2288 SHHS: 77.97±0.2693 MGH Sleep: 72.03±0.1823 Baseline: MoCo SimCLR BYOL SimSiam	Link
Self-Supervised Learning for Sleep Stage Classification with Predictive and Discriminative Contrastive Coding	Xiao et al. (2021)	1. Labeling work is costly and laborious interms of specialist eperience and manual work. 2. ground truth lables annotated by sleep experts can also be contradictory, bad influence on label-relied tasks. 3. Extracted representations by supervised models are not generalized.	1. The proposed SleepDPC framework is a pioneer to apply SSL on sleep stage classification. 2. Proposed two learning principles, Predictive contrastive coding, Discriminative contrastive coding, enable extract high-level semantics (underlying rhythms and patterns) from raw EEG.	Sleep stage classification	Sleep-EDF, ISRUC	Combining PCC and DCC: PCC(predictive contrastive coding): other representation(at different timestep) in the mini-batch are considered as "unrelated"(negative), DCC(discriminative contrastive coding): representations in different segment of a mini-batch are temporally distant, as negative pair.	Pre-train: encoder: CNN aggregator: GRU and LSTM predictor: not mentioned Fine-tuning: encoder and aggregator are frozen. classifier: one-layer fully-connected network.	SleepDPC (10% labels) SleepEDF: Accuracy: 0.701±0.008 F1-macro: 0.640±0.015 ISRUC: Accuracy: 0.536±0.015 F1-macro: 0.489±0.018	Link
CoSleep: A Multi-View Representation Learning Framework for Self-Supervised Learning of Sleep Stage Classification	Ye et al. (2022)	1. Large-scale labeled datasets are still hard to acquire 2. DPC operates discrimination at an instance level(treats each instance as a single class); seasonality of time-searies indicates that distant instances can be semantically close	1. Novel co-training scheme by exploiting complementary information from time and frequency view of physiological signals to mine more positive samples. 2. Extend the framework with a memory module, implemented by a queue and a moving-averaged encoder, to enlarge the pool of negative candidates.	Sleep stage classification	SleepEDF, ISRUC	Use multi-instance infoNCE loss, calculating loss function using multiple positive samples. Select the Top-K positive samples by time- and frequency-domain similarities.	Pre-training: Two encoders: CNN with residual connections (ResNet); aggregator: GRU/LSTM Finetuning: encoder and aggregator are freezed. classifier: one-layer fully-connected network.(10%label)	CoSleep: SleepEDF: ACC: 0.716±0.043 F1: 0.558±0.03 ISRUC: ACC: 0.579±0.051 F1: 0.501±0.056	Link
A Self-Supervised Learning Based Channel Attention MLP-Mixer Network for Motor Imagery Decoding	He et al. (2022)	1. CNN for MI EEG decoding 's performance is generally limited due to the small size sample problem. 2. To address 1, EEG trials segment into small slices, usually inevitably losses the longrange dependencies of temporal information.	1. A new EEG slice prediction task as pretext task to capture the long-range information in time domain. 2. In the downstream task, a MLP-Mixer is for classification task for signal(rather than image) 3. An attention mechanism is integrated into MLP-Mixer to estimate the importance of each EEG channel.	Motor Imagery (movement imagination classification)	MI-2 Dataset, BCIC-IV-2A Dataset	150-time points sliding window (overlap of 10 points), z-score normalization on each slice.	Pretext task: 3 adjacent EEG slices -> local encoder(1D CNN) -> concatenation -> LSTM layers -> conv and linear -> predicted EEG slice Downstream task: EEG slice -> Local encoder(with Weights from pretext) -> Channel-attention MPL-Mixer(CAU&TMU) -> Classifier(global average pooling -> Linear layer -> Softmax -> Prediction)	MI-2: ACC: 78.5±0.64 F1: 78.39±0.67 BCI-IV-2A: ACC: 79.43±1.73 F1: 79.42±1.74	-
Self-supervised Contrastive Learning for EEG-based Sleep Staging	Jiang et al. (2021)	Data shortage of supervised learning	Propose a self-supervised contrastive learning for EEG sleep staging classification, measures the feature similarity if transformed signal pairs.	EEG-based sleep staging classification	Sleep-edf, Sleep-edfx, Dod-O, Dod-H	Transformations: Sleep-edf: crop&resize + permutation; crop&resize + crop&resize. together: crop&resize + time warping; crop&resize + permutation	SSL training: input: transformed unlabelled data; encoder: ResNet based; positive pair: homologous pair; negative pairs: others. Fine tuning: classifier: FC layers.	Healthy subjects: Acc: 88.16; F1: 81.96 Healthy and subjects with sleep disorders: Acc: 84.42; F1: 78.95	Link
Self-Supervised Contrastive Pre-Training For Time Series via Time-Frequency Consistency	Zhang et al. (2022)	Lack of data labels	Propose the assumption of Time-Frequency Consistency: the information is taken in the time domain and in the frequency domain is equivalent.	Sleep disorder, Eplipsy detection, Mechanical fault detection, etc.		Time domain: shift, jittering, etc. Frequency domain: adding/removing frequency component	CNN-based encoder, MLP-based projector	-	Link