eead-csic-compbio/get_homologues

Problem generating alignment file while using annotate_cluster.pl

apoorva004 opened this issue · 5 comments

Hi I had used annotate_cluster.pl command to annotate my clusters of interest.
Following command was used:
./annotate_cluster.pl -f ./sample_intersection_29thMarch/149064_gseA.faa -o ./sample_intersection_29thMarch/149064_gseA1.aln.faa -D -P

I got this output:

Checking required binaries and data sources, all set in phyTools.pm :
EXE_BLASTP : OK (path:/home/apoorva004/apoorvahomo/get_homologues-x86_64-20210828/bin/ncbi-blast-2.8.1+/bin/blastp)
EXE_BLASTN : OK (path:/home/apoorva004/apoorvahomo/get_homologues-x86_64-20210828/bin/ncbi-blast-2.8.1+/bin/blastn)
EXE_FORMATDB : OK (path:/home/apoorva004/apoorvahomo/get_homologues-x86_64-20210828/bin/ncbi-blast-2.8.1+/bin/makeblastdb)
EXE_MVIEW : OK (path:/home/apoorva004/apoorvahomo/get_homologues-x86_64-20210828/lib/mview/bin/mview )
EXE_HMMPFAM : OK (/home/apoorva004/apoorvahomo/get_homologues-x86_64-20210828//bin/hmmer-3.1b2/binaries/hmmscan --noali --acc --cut_ga /home/apoorva004/apoorvahomo/get_homologues-x86_64-20210828/db/Pfam-A.hmm)

DEFBLASTNTASK=megablast DEFEVALUE=10
MINBLUNTBLOCK=100 MAXSEQNAMELEN=60
MAXMISMCOLLAP=0 MAXGAPSCOLLAP=2

./annotate_cluster.pl -f ./sample_intersection_29thMarch/149064_gseA.faa -r -o ./sample_intersection_29thMarch/149064_gseA1.aln.faa -P 0 -b 0 -D 1 -c 0 -A -B
Use of uninitialized value $1 in hash element at ./annotate_cluster.pl line 188.

total sequences: 1 taxa: 1

Pfam domains: PF13365,
Pfam annotation: Trypsin-like peptidase domain;
executeFORMATDB : cannot find input FASTA file /tmp/hcihZ_Yy98
So basically it does annotate my cluster, but does not generate an alignment file. Kindly suggest what can I do to correct this. Thanks.

Hi @apoorva004 , can you please share your -f input file?

149064_gseA.zip
Hi, here is the input FAA file. I want to do this analysis on multiple files of my interest (gene clusters belonging to core/ soft core genome), so i am looking forward to have a proper output that can be sorted and analyzed with programming. Thanks.

Hi @apoorva004 , using the latest version of the script I get the following output:

perl annotate_cluster.pl -f 149064_gseA.faa -P -D

Checking required binaries and data sources, all set in phyTools.pm :
...

# DEFBLASTNTASK=megablast DEFEVALUE=10
# MINBLUNTBLOCK=100 MAXSEQNAMELEN=60
# MAXMISMCOLLAP=0 MAXGAPSCOLLAP=2

# annotate_cluster.pl -f 149064_gseA.faa -r  -o  -P 1 -b 0 -D 1 -c 0 -A  -B 

# total   sequences: 1 taxa: 1
# longest sequence: 262 (ID:FHFMMBKL_00156)
# Need at least two input sequences, exit.

So it seems in this case the cluster cannot be aligned because it contains one sequence only, right?

We frequenty call annotate_cluster.pl from a bash script to align all clusters and save the resulting FASTA files to -o , let me know if you want me to share that one liner,
Bruno

Hi @brunocontrerasmoreira,
I want to annotate these clusters using pfam database, Is it possible to use your script to do that ? As you can see in my previous attempt, I was able to get the annotation just not the alignment file. So I would like to have that output. Please guide me on this. Thanks.

Hi, I have updated the script (see 0e93a43) so that you can annotate Pfam domain in all clusters, even in singletons. If you don't need the local alignments in FASTA format you won't need option -o , they will still be printed to stdout. So you might as well save them to separate files just in case and your output will be cleaner and easier to parse. You can try something such as (if you have 30 CPU cores):

cd folder_with_fasta_clusters
cat list.cluster.filenames  | parallel --gnu -j 30 ~/soft/get_homologues/annotate_cluster.pl -f {} -P -o ../folder_aligned_clusters/{} ::: &> ../../log.aln