lh3/miniprot

Is it necessary to shield repetitive sequences?

Closed this issue ยท 1 comments

Hi, Professor Li, excuse me

  1. should there be no difference between soft shielding and unshielded miniprot?
  2. if hard masking is used, some CDs sequences extracted from GFF files contain about 10 to 100 N, how should I deal with these CDs sequences containing N?
  3. how should I screen the extracted CDs sequences whether or not the repetitive sequences are shielded? For example, if the protein length is less than 50 amino acids, discard all or other standards?
    The following is a CdS sequence extracted from the GFF file by miniprot annotation using proteins of the same species. The length of the protein translated by seqkit and gffreed also made me a little confused?
    0945437d1960a34ee7fdac040762c75
    aa73200dba654850b502b00e5741620
lh3 commented

Repeat masking will affect the alignment of some proteins as you showed. I don't know whether that is a positive or negative effect overall. You have to do a research by yourself.