tttianhao/CLEAN

Duplicated amino acid sequece in datasets

Closed this issue · 0 comments

There exist duplicated amino acid sequences in split100.csv and new.csv.

image image

The feature extraction script of ESM provided by Facebook does not allow duplicated sequences. However, the authors of this repository utilize the feature extraction script of ESM which can not work properly when there exist duplicates. Please provide a detailed procedure including how the features of sequences in the NEW-392 dataset and Split-100 dataset are obtained.

Almost 30k duplicates in split100.csv. But there are no duplicates in split70.csv. This is weird too. If the sequences are duplicated intentionally for contrastive learning, then why there are no duplicates in split70.csv?