LiPCoT: Linear Predictive Coding based Tokenizer for Self-Supervised Learning of Time Series Data via BERT
LiPCoT (Linear Predictive Coding based Tokenizer for time series) is a novel tokenizer that encodes time series data into a sequence of tokens, enabling self-supervised learning of time series using existing Language model architectures such as BERT.
Main Article: LiPCoT: Linear Predictive Coding based Tokenizer for Self-supervised Learning of Time Series Data via Language Models
- Unlike traditional time series tokenizers that rely heavily on CNN encoder for time series feature generation, LiPCoT employs stochastic modeling through linear predictive coding to create a latent space for time series providing a compact yet rich representation of the inherent stochastic nature of the data.
- LiPCoT is computationally efficient and can effectively handle time series data with varying sampling rates and lengths, overcoming common limitations of existing time series tokenizers.
If you use this dataset or code in your research, please cite the following paper:
@misc{anjum2024lipcot,
title={LiPCoT: Linear Predictive Coding based Tokenizer for Self-supervised Learning of Time Series Data via Language Models},
author={Md Fahim Anjum},
year={2024},
eprint={2408.07292},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
We use EEG dataset of 28 PD and 28 control participants.
- Original dataset can be found at link. The data are in .mat formats and you need Matlab to load them. (No need for this unless you are interested into the original EEG data)
- Raw CSV dataset used for this repo can be found at link. Download this for running all steps in this repo.
- If you want to run all steps:
- Download raw CSV dataset and place them in the
data/raw
folder - Run Step 1-7
- Download raw CSV dataset and place them in the
- If you want to run only the BERT models, run Step 4-6. No need to download raw data as the processed dataset is included in this repo.
First, the data must be processed. data_processing
notebook loads raw data and prepares training,validation and test dataset.
data_tokenizer
notebook tokenizes the data using LiPCoT model
data_prepare
notebook prepares datasets for BERT models. If you are downloading from GitHub, up to this step is done for you.
pretrain_bert
notebook conducts pretraining of BERT model.
If you are running code with data from GitHub, start with this step.
finetune_bert
notebook conducts fine-tune of BERT model for binary classification
finetune_bert_without_pretrain
notebook uses a randomly initialized BERT model and fine tunes it for classification
cnn_classifier
notebook uses CNN model as described in Oh et. al. (2018)deepnet_classifier
notebook uses Deep Convolutional Network as described in Schirrmeister et. al. (2017)shallownet_classifier
notebook uses Shallow Convolutional Network as described in Schirrmeister et. al. (2017)eegnet_classifier
notebook uses EEGNet as described in here