nmquijada/tormes

custom protein database

Opened this issue · 4 comments

Hi,
I am analyzing multiple bacterial genomes with very little programing knowledge. The way tormes parses and summarizes the results from all the genomes in tabular files is very helpful!
Tormes has now an option to query the genomes with a custom nucleotide database. But what I have is a protein database... is there anyway to do this with tormes? Any other suggestion? In the end, I'd really need a genome X protein sort of table...
Thanks!

Hi @shlomobl

I am afraid that in the current version of tormes, only custom nucleotide databases for gene search are possible as an integrated option. We have included the chance of custom amino acid database searches in the ongoing development version of the tool, that we hope to release after summer. I will keep you posted.

In the meantime, if you want to use an amino acid database I can guide you to do so by using blastp and by taking advantage of tormes hierarchy of files. Would that be an option for you?
The predicted proteins of your genomes would be in the gene_prediction or annotation directories (depending the option you used for run the pipeline)

Additionally, you can add those proteins to the database that is used for annotation with prokka and to look for them in the annotation results.

Hi,
Yes, please, I appreciate it!
Especially if results can be summarized in a presence/absence table with all genomes, similar to VFs/AMR.
I guess it is easier to generate a table from BLAST than by adding these genes to annotation?
Thanks!
S.

Hi @shlomobl

Sorry for the late reply. Both doing a BLAST or adding the proteins to the annotation files for the analyses are straightforward processes. However, from the latter you might retrieve back the information from the genes you are looking for.

If you would like the results to appear in the tormes report, it would require some expertise with r-markdown language, which is the one used for the generation of that report. If you don't have experience with this, I would encourage you to wait a bit until we release the next version, which will allow the usage of protein databases for direct "blasting".

In the meantime, if you would like to look for some proteins in your dataset with BLAST, you need to make a blast-formatted database first:

makeblastdb -in my_proteins.faa -title my_prot -out my_db/my_prot -dbtype prot -hash_index

Then, you can run BLASTP over the predicted protein file performed by prodigal (and/or annotated with prokka). For instance:

blastp -query tormes_output/annotation/genome_01_annotation/genome_01.faa -db my_db/my_prot -out blastp_output.txt -max_target_seqs 1000 -culling_limit <culling limit to be used (>1)> -evalue 1e-25 -num_threads <num of CPUs> -outfmt "6 qseqid sseqid length qstart qend sstart send mismatch gaps pident evalue bitscore slen"

#you can add a header to the file with the description of the fields, for instance:
sed -i "qseqid\tsseqid\tlength\tqstart\tqend\tsstart\tsend\tmismatch\tgaps\tpident\tevalue\tbitscore\tslen" blastp_output.txt

As I said, I hope we can release the next version soon.
I hope this helps in the meantime and you can do some searches of proteins of your interest!

Best,
Narciso