Refine the sequence extractor to be compatible with the non-SNP variants
Zehui127 opened this issue · 1 comments
Zehui127 commented
Description
The current sequence extractor introduced in #2 is compatible with the SNP, but for non-SNP such as deletion or insertion, the behaviour is done with sequence shifting for both deletion and insertion. It could be beneficial if we have alternative implementations such as the following to compare the performance of foundational models.
Tasks
- For deletion: Replace deleted tokens with N. But it may raise issues for models which doesn't have the NULL token. For example, 'Reference Base': 'TGAA', 'Alternate Base': [T], the output should be. CCCCTGAACCCC -> CCCCTNNNCCCC.
- For insertion: Insert the N token into the original reference . For example, 'Reference Base': 'T', 'Alternate Base': TGAA, the output should be CCCCTNNNCCCC -> CCCCTGAACCCC
Zehui127 commented
outdated issue