Preparation of database from OrthoDB
Closed this issue · 9 comments
Hello - My question is similar to the #37. I'm using ProtHint via braker2 and I wanted to ask about the preparation of the protein database starting from OrthoDB. The documentation says:
We recommend to use a relevant portion of OrthoDB protein database as the source of reference protein sequences.
Could you explain how to extract the relevant portion? For example, if I wanted to extract the protozoa proteins, how should I proceed? (Apologies if this question is more suited to OrthoDB).
More in general, how broad or specific should the database be? For example, if I wanted to annotate Plasmodium, would it be better to use all the protozoa proteins or better be more specific? Thanks!
Hi @dariober,
take a look at this section in the readme: https://github.com/gatech-genemark/ProtHint#protein-database-preparation
So in your case, you'd want to use
wget https://v100.orthodb.org/download/odb10_protozoa_fasta.tar.gz
More in general, how broad or specific should the database be? For example, if I wanted to annotate Plasmodium, would it be better to use all the protozoa proteins or better be more specific? Thanks!
In this case, I'd select just proteins from the Apicomplexa phylum. Adding more would probably just increase runtime and maybe add a little noise. However, if you don't want to go through the filtering effort, using all protozoa proteins will work fine as well.
Best,
Tomas
@tomasbruna Thanks for getting back to me. I saw that the protozoa database is available for download. My question is how to create it in the first place. In fact, if I wanted to parse OrthoDB to get the Apicomplexa proteins, how would I proceed?
To prepare the full protozoa database, you just concatenate all fasta files into one big file, like so:
cat protozoa/Rawdata/* > proteins.fasta
That's the "database" you give to BRAKER2.
if I wanted to parse OrthoDB to get the Apicomplexa proteins, how would I proceed
In this case, you would only concatenate protein files of Apicomplexa species. There is one file per species in the protozoa/Rawdata/
folder and the files are named by the species ID
. The species name to species ID
mapping is available here.
Ok, I think I haven't made myself clear but I think I get it... I understand that the database is a fasta file of proteins related to the genome to be annotated. What wasn't clear to me is how you can compile such file in a relatively straightforward way. Your comment:
The species name to species ID mapping is available here.
tells me that one needs to collect all the taxonomy IDs for the species of interest (e.g. all the Apicomplexa) and then query the file odb10v1_all_fasta.tab.gz to get those species (by the way, I didn't realize that the OrthoDB files are named after the taxonomy ID). If that's the way to go (or at least a way to go) that's fine, I was wondering if there was a more immediate solution. Thanks-!
Yes, that's the way to go. The only tedious step is to collect species names/IDs of all Apicomplexa in OrthoDB. You can use OrthoDB's advanced search for that. It will give you a hierarchy looking like this:
If any useful, here are a few commands to query the data files from OrthoDB to extract the protein sequences belonging to a given taxonomic group. For example, we want all the proteins under the "Apicomplexa" group.
First query odb10v1_level2species.tab.gz
to get the taxonomy ID of Apicomplexa:
GROUP="Apicomplexa"
level=`zcat odb10v1_levels.tab.gz | awk -v grp=$GROUP '$2 == grp'`
taxid=`echo $level | cut -d ' ' -f 1`
Then query odb10v1_level2species.tab.gz
to get all the species under the Apicomplexa group. The awk
command may be more convoluted than necessary:
species=`zcat odb10v1_level2species.tab.gz \
| awk -v taxid=$taxid '$4 ~ "{"$taxid"}" ||
$4 ~ "{"taxid"," ||
$4 ~ ","taxid"," ||
$4 ~ ","taxid"}" {print $2}'`
You may want to check that the number of species retrieved is the same as one reported in odb10v1_levels.tab.gz
, 54 in this case:
echo $species | wc -w
echo $level
Prepare a regex of species IDs and query the (big) file odb10v1_all_fasta.tab
. Conveniently, odb10v1_all_fasta.tab
has sequences on a single line instead of being wrapped so we can use grep -A 1 ...
(you may want to check this format doesn't change in the future).
regex=`echo $species | tr ' ' '|'`
grep --no-group-separator -A 1 -P -w "$regex" odb10v1_all_fasta.tab > "$GROUP.fasta"
(Not fully tested - suggestions welcome)
This should now be way easier thanks to https://bioinf.uni-greifswald.de/bioinf/partitioned_odb11/ and https://github.com/tomasbruna/orthodb-clades