From a given taxon sampling and a given seed species compute a normal sequence based phylome and a novel structure based phylome.
The first thing must be installing snakemake and you can easily do this with this conda command
conda create -c conda-forge -c bioconda -n snakemake snakemake hdf5 snakefmt snakedeploy
Then from now one I'd reccomend to manage further dependencies with conda using the --sdm conda
flag in snakemake. Otherwise you can find in workflow/envs/
different yaml files to recreate what is needed at each step.
The only dependency not automatically managed in the pipeline is gsutil: follow these instructions for gsutil installation.
First of all you need to download the data, to do this you need an input file with 3 columns (Uniprot ID\tTaxid\tmnemonic), see data/input/Test_set.txt
. You also need a config file where you can specify various parameters, see config/test.yaml
for example. The analysis in the paper were done using config/Hsapopi_test.yaml
.
# Input file
dataset: 'example'
taxids: 'data/input/Test_set.txt'
species_tree: 'data/sptrees/Test.nwk'
seed: ['UP000000625']
root: 'ecoli'
# this is the number of seed genes to run the pipeline
test_seeds: 15
n_examples: 3
Importantly, global parameters that are likely to be used across different datasets are in config/params.yaml
. Note that the values in the first yaml are prioritary to the ones in this last file!
Once you have these you can simply run:
snakemake -s workflow/download_data.smk --configfile config/test.yaml -p -j2 --sdm conda
This first pipeline is necessary to get all input files. From the input table we can download all the pdbs from google and then consider only those entries with mean average quality > params["low_confidence"]
value. These will be moved into the high_cif
folder for each proteome.
You can change the directory where all these data are stored with params["data_dir"]
parameter but I would just use the default one.
To run the actual pipeline you just need as input the species tree and the previous uniprot table file. Once you have these you can decide the target sets and the methodological implementations that you'd like to explore:
combinations: ["3Di_3Di", "aa_QT", "aa_LG", "3Di_GTR", "3Di_FT", "3Di_LLM", "3Di_AF"]
#, "3Di_FTPY"]
# Subset of the first one that requires ML
combinations_ML: ["3Di_3Di", "aa_LG", "3Di_GTR", "3Di_LLM", "3Di_AF"]
Another important parameter is the seed
. You can specify one or a list of many (although this feature is more or less untested) and the pipeline will create one output per seed species.
To run the pipeline you can simply run this command and monitor that everything is more or less running fine.
snakemake --configfile config/test.yaml -p -j2 -k --sdm conda
results/{dataset}/trees/{seed}_unrooted_trees.txt
: has 5 columns (gene ID, target set, alphabet, model and tree text). This file is easily parsable in R or Python to do further analyses. You can find some R scripts to do that inworkflow/scripts/
results/{dataset}/plots/{seed}*pdf
: here you will find various plot that should inform you about the homology search and tree reconstruction quality.results/data
: here you will find very useful files for specific downstream analyses including the fastas and a cif file, a prediction file and the PAE file per protein. You will also find agffs
folder where uniprot gffs are stored and acath
folder where each proteome has a file that connects protein id to Pfam and Gene3D entries. In theids
folder you can also see various phylogenomic useful links in case you would like to benchmark with different homology sets as inferred by OMA, EggNOG or PhylomeDB (among others).
- look at Reseek and maybe implement it
- File structure is a bit complicated and there is a bit of redundancy, for example rerunning blast everytime or the metadata.
- Remove requisite for specifying common and union set. -> rethink runtime plots and in general output plots