bowang-lab/genomic-FM

Refine the sequence extractor to be compatible with the non-SNP variants

Zehui127 opened this issue · 1 comments

Description

The current sequence extractor introduced in #2 is compatible with the SNP, but for non-SNP such as deletion or insertion, the behaviour is done with sequence shifting for both deletion and insertion. It could be beneficial if we have alternative implementations such as the following to compare the performance of foundational models.

Tasks

  • For deletion: Replace deleted tokens with N. But it may raise issues for models which doesn't have the NULL token. For example, 'Reference Base': 'TGAA', 'Alternate Base': [T], the output should be. CCCCTGAACCCC -> CCCCTNNNCCCC.
  • For insertion: Insert the N token into the original reference . For example, 'Reference Base': 'T', 'Alternate Base': TGAA, the output should be CCCCTNNNCCCC -> CCCCTGAACCCC

outdated issue