Jwoo5/fairseq-signals

Train with own dataset

sehunfromdaegu opened this issue · 6 comments

Hi, thank you for sharing the great work!

I want to pretrain a model with one of the method here with my own dataset.

My dataset is in a huge numpy array, and is it possible to start training with a numpy array?

Thank you.

[Additional question.] I cannot find how CMSC method treats ECG segments in different leads in the same temporal space. Does this method only train on a single-lead dataset?

Hi Sehun,

fairseq-signals does not currently support such dataset class that handles numpy array directly.
If you want to use the package without any modification, you may need to split each sample in the numpy array to follow .mat format that can be readable by scipy.io.loadmat(...) where each sample has a ECG signal as a key feats (shape: [L, T] where L is the number of leads and T is the sample size) and the sampling rate as a key curr_sample_rate. Then, prepare the manifest file (.tsv) containing the root directory path in the first line and each sample information in the next lines following this format: $sample_path.mat \t $sample_size. I recommend you to process physionet2021 dataset first and see what's going on in the resulted data and the resulted manifest file.

For your additional question, we didn't include CMLC or CMSMLC methods in this package. Rather, we made the model process 12-lead signals at a time and be trained by CMSC manner where the representation vectors from the same ECG but different regions become closer while the vectors from different ECGs become further with each other.

Thank you very much for the kind explanation.
For the second question, I was wondering how CMSC processes a 12-lead ECG in your experiment. The explanation in your paper seems to focus on a single-lead situation in CMSC, but then experiments are done with 12-lead ECG as well.

For example, if x and y correspond to different leads of a single 12-lead ECG, are they treated different inputs for CMSC?

Again, thanks for sharing this wonderful repo :)

Not actually. We processed 12-lead ECGs in the model, not single-lead ECGs.
In other words, the model gets a 12-lead ECG as an input, then outputs the corresponding representation vector.
For CMSC, after processing all the 5-second 12-lead ECGs in a batch separately, we gather pairs of the representation vectors from the same ECG, and made them closer with each other while making the pairs from different ECGs further with each other. In this process, we forced the 5-second samples in a batch are composed of pairs of 2 adjacent 5-second samples from the same ECGs to calculate CMSC loss for each of them.

Then, for a single 12-lead ECG, we get 12*2=24 representation vectors (two representations for each lead). For linear evaluation, we need a single representation for a 12-lead ECG. Are you take the average of all 24 representation vectors for downstream task?

The model implemented in this package doesn't process each lead individually, rather it processes them at a time, which means that a 12-lead ECG is converted into a 768-length vector finally.

Thank you for clarifying :)