eead-csic-compbio/get_homologues

annotate_cluster.pl excludes indentical sequences

carolynzy opened this issue · 5 comments

Hi, I'm using annotate_cluster.pl on my clusters while I noticed a strange thing.
Take cluster 1228009 for example, there are 421 sequences in this cluster. Every sequence has 64 aa. They are almost identical. However, when using annotate_cluster.pl, only 251 sequences would be aligned.
I read the manual which said some short fragments could be left out due to not aligned to the longest sequence, which is not my case I think.
log.annotate_cluster.1228009.txt
1228009_ubiquitin-like_prote.txt

I attached the fasta file as well as the log file. Would you please check this issue? Thank you!

P.S. I took a further look and found that it seems no matter how many sequences in the cluster, only a maximum of 251 sequences will be aligned despite the sequences are highly similar.

Hi @carolynzy , did you use option -c ?

No. I didn't use -c.

Will have a look later in the day

Hi @carolynzy ,
there were two issues here:

  1. The code was taking only the default number of hits reported by BLASTP, now it takes as many as sequences in the cluster. See 561894a

  2. The sequences in your sample cluster have redundant names, see:

    perl -lne 'if(/^>(\S+)/){ print $1 }' 1228009_ubiquitin-like_prote.txt | sort -u |wc
    355 355 4263

I have commited the changes, you should take care of sequence names on your side to resolve this limitation,
Bruno

@eead-csic-compbio Thank you! I do have changed the name but don't know why I still uploaded the original version. Thank you very much!