/structural_phylome

Compute a sequence and structure based phylome based on uniprot proteomes.

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

Structural Phylome

From a given taxon sampling and a given seed species compute a normal sequence based phylome and a novel structure based phylome.

Usage

Installation

The first thing must be installing snakemake and you can easily do this with this conda command

conda create -c conda-forge -c bioconda -n snakemake snakemake hdf5 snakefmt snakedeploy

Then from now one I'd reccomend to manage further dependencies with conda using the --sdm conda flag in snakemake. Otherwise you can find in workflow/envs/ different yaml files to recreate what is needed at each step.

The only dependency not automatically managed in the pipeline is gsutil: follow these instructions for gsutil installation.

Data preparation

First of all you need to download the data, to do this you need an input file with 3 columns (Uniprot ID\tTaxid\tmnemonic), see data/input/Test_set.txt. You also need a config file where you can specify various parameters, see config/test.yaml for example. The analysis in the paper were done using config/Hsapopi_test.yaml.

# Input file
dataset: 'example'
taxids: 'data/input/Test_set.txt'
species_tree: 'data/sptrees/Test.nwk'
seed: ['UP000000625']
root: 'ecoli'

# this is the number of seed genes to run the pipeline
test_seeds: 15
n_examples: 3

Importantly, global parameters that are likely to be used across different datasets are in config/params.yaml. Note that the values in the first yaml are prioritary to the ones in this last file!

Once you have these you can simply run:

snakemake -s workflow/download_data.smk --configfile config/test.yaml -p -j2 --sdm conda

This first pipeline is necessary to get all input files. From the input table we can download all the pdbs from google and then consider only those entries with mean average quality > params["low_confidence"] value. These will be moved into the high_cif folder for each proteome.

You can change the directory where all these data are stored with params["data_dir"] parameter but I would just use the default one.

Phylogeny pipeline

To run the actual pipeline you just need as input the species tree and the previous uniprot table file. Once you have these you can decide the target sets and the methodological implementations that you'd like to explore:

combinations: ["3Di_3Di", "aa_QT", "aa_LG", "3Di_GTR", "3Di_FT", "3Di_LLM", "3Di_AF"]
#, "3Di_FTPY"]
# Subset of the first one that requires ML 
combinations_ML: ["3Di_3Di", "aa_LG", "3Di_GTR", "3Di_LLM", "3Di_AF"]

Another important parameter is the seed. You can specify one or a list of many (although this feature is more or less untested) and the pipeline will create one output per seed species.

To run the pipeline you can simply run this command and monitor that everything is more or less running fine.

snakemake --configfile config/test.yaml -p -j2 -k --sdm conda

Outputs

  • results/{dataset}/trees/{seed}_unrooted_trees.txt: has 5 columns (gene ID, target set, alphabet, model and tree text). This file is easily parsable in R or Python to do further analyses. You can find some R scripts to do that in workflow/scripts/
  • results/{dataset}/plots/{seed}*pdf: here you will find various plot that should inform you about the homology search and tree reconstruction quality.
  • results/data: here you will find very useful files for specific downstream analyses including the fastas and a cif file, a prediction file and the PAE file per protein. You will also find a gffs folder where uniprot gffs are stored and a cath folder where each proteome has a file that connects protein id to Pfam and Gene3D entries. In the ids folder you can also see various phylogenomic useful links in case you would like to benchmark with different homology sets as inferred by OMA, EggNOG or PhylomeDB (among others).

TODOs

  • look at Reseek and maybe implement it
  • File structure is a bit complicated and there is a bit of redundancy, for example rerunning blast everytime or the metadata.
  • Remove requisite for specifying common and union set. -> rethink runtime plots and in general output plots