annotate_cluster.pl excludes indentical sequences
carolynzy opened this issue · 5 comments
Hi, I'm using annotate_cluster.pl on my clusters while I noticed a strange thing.
Take cluster 1228009 for example, there are 421 sequences in this cluster. Every sequence has 64 aa. They are almost identical. However, when using annotate_cluster.pl, only 251 sequences would be aligned.
I read the manual which said some short fragments could be left out due to not aligned to the longest sequence, which is not my case I think.
log.annotate_cluster.1228009.txt
1228009_ubiquitin-like_prote.txt
I attached the fasta file as well as the log file. Would you please check this issue? Thank you!
P.S. I took a further look and found that it seems no matter how many sequences in the cluster, only a maximum of 251 sequences will be aligned despite the sequences are highly similar.
Hi @carolynzy , did you use option -c ?
No. I didn't use -c.
Will have a look later in the day
Hi @carolynzy ,
there were two issues here:
-
The code was taking only the default number of hits reported by BLASTP, now it takes as many as sequences in the cluster. See 561894a
-
The sequences in your sample cluster have redundant names, see:
perl -lne 'if(/^>(\S+)/){ print $1 }' 1228009_ubiquitin-like_prote.txt | sort -u |wc
355 355 4263
I have commited the changes, you should take care of sequence names on your side to resolve this limitation,
Bruno
@eead-csic-compbio Thank you! I do have changed the name but don't know why I still uploaded the original version. Thank you very much!