WORK IN PROGRESS Workflow to reconstruct multiple metabolic networks in order to compare them.
Table of contents
This workflow is licensed under the GNU GPL-3.0-or-later, see the LICENSE file for details.
These tools are needed:
- Exonerate
- Orthofinder (which needs Diamond, FastME, and MMseqs2)
- Pathway Tools (which needs Blast)
- R
And some python packages:
To run annotation based reconstruction, you need to install Pathway Tools. This tool is available at the Pathway Tools website. A command in the package install the tool:
aucome --installPWT=path/to/pathway/tools/installer
source ~/.bashrc
You can also provide an option to this commande: --ptools=ptools_path
This option let you choose the path where the ptools-local folder will be installed. PGDBs created by Pathway Tools are stored in this folder.
You also should install the MetaCyc_XX.X.padmet (the version number of MetaCyc is replaced with XX.X), and then you should update your config.txt files for each study. This is the way to getting a MetaCyc_XX.padmet file: Firstly, download the flat files of MetaCyc in DAT format at the https://biocyc.org/download.shtml webpage. Secondly, put all the downloaded DAT files in a directory (it is named FLAT_DIR here). Thirdly run this command:
padmet pgdb_to_padmet --pgdb=FLAT_DIR --output=metacyc_XX.X.padmet --version=XX.X --db=Metacyc -v
- From git repository, download the Dockerfile in recipes/.
- Install docker and docker.io if is not done yet.
- Build the AuCoMe Docker image, like this:
docker build -t aucome .
- Enter in the Aucome Docker image.
- Install Pathway Tools and metacyc_XX.padmet.
- Run the Aucome commands.
You need to have a pathway tools installer on the same path as the recipe.
From git repository:
sudo singularity build aucome.sif Singularity
If you have the issue:
FATAL: While performing build: while creating squashfs: create command failed: exit status 1: Write failed because No space left on device
FATAL ERROR: Failed to write to output filesystem
It is because Singularity has not enough space in its temporary folder due to the size of the
tools needed by aucome. You can modify manually this path using the SINGULARITY_TMPDIR
variable (the temporary folder must exist), for example:
sudo SINGULARITY_TMPDIR=/home/user/tmp_folder singularity build aucome.sif Singularity
Then you can run the container with command like:
singularity run aucome.sif aucome workflow --run data --filtering --cpu 10
But using only these commands can produce errors due to the compartmentalization of singularity.
So it is better to use the -c
to avoid sharing filesystem with host.
And the -B
allows to give a shared folder between the host and the singularity container
so Singularity can also access to the data in the host.
singularity run -c -H /path/outside/singularity/to/shared:/path/in/singularity/container aucome.sif aucome workflow --run /path/in/singularity/container/data --filtering --cpu 10
If you have installed all the dependencies, you can just install acuome with:
pip install aucome
You have to create the working folder for AuCoMe, with the --init argument:
aucome --init=run_ID [-v]
This command will create a folder name "run_ID" inside the working folder. In this "run_ID" folder, the command will create all the folders used during the analysis.
run_ID
├── analysis
├── group_template.tsv
├──
├── annotation_based
├── PADMETs
├──
├── PGDBs
├──
├── SBMLs
├──
├── config.txt
├── logs
├──
├── networks
├── PADMETs
├──
├── SBMLs
├──
├── orthology_based
├── 0_Orthofinder_WD
├── OrthoFinder
├── 1_sbml_orthology
├── 2_padmet_orthology
├── 3_padmet_filtered
├── structural_check
├── 0_specifics_reactions
├── 1_blast_results
├── analysis
├── tmp
├── 2_reactions_to_add
├── 3_PADMETs
├── studied_organisms
├──
analysis will store the various analysis of the PADMET files which are in the networks folder.
annotation_based includes three subfolders. The PGDBs folder will contain all the results from Pathway Tools (in DAT format). These results will also be stored in PADMET and SBML files inside PADMETs and SBMLs.
config.txt contains numerous paths used by the script: paths to programs, directories and databases. It also inclues the Pathway Tools and MetaCyc versions.
networks will contain one metabolic network per studied organism, created thanks to AuCoMe, in PADMET and SBML formats that are stored into two directories (PADMETs and SBMLs). It also includes the panmetabolism of all the studied organisms in PADMET and SBML format.
orthology_based contains four subfolders. Firstly the 0_Orthofinder_WD directory folder will include all the run of Orthofinder. Secondly, the 1_sbml_orthology folder will contain one subdirectory per studied organims, and each subfolders include SBML files with the orthogroups of other species that OrthoFinder found. Thirdly, the 2_padmet_orthology directory will contain the PADMET files created with the orthology step. Fourthly, the 3_padmet_filtered folder will contain PADMET files created thanks to the orthology step, but in this subfolder only the robust reactions are kept in these PADMET files.
structral_check relies on the search on the genomes for missing Gene-Proteins-Reactions associations. All the metabolic networks previously created are be pairwise compared. If one metabolic network has a Gene-Protein-Reaction association that another one has not, a genomic search will be performed between both genomes corresponding with the both metabolic networks. Gene-Protein-Reaction associated with the first metabolic network will be used to search for match with the genome sequence corresponding with of the second metabolic network. It contains four subdirectories. Firstly 0_specifics_reactions folder will include numerous TSV files with lists of Gene-Protein-Reaction associations that are present in a metabolic network and that are absent in another metabolic network. Secondly, the 1_blast_results directory will contain the search results between genomes of studied organisms and selected genes in the previous TSV files. Here orther TSV files will also be created with another format. These TSV files will include the results of genomic search programs. BlastP, TblastN, and Exonerate are used as genomic search programs. Thirdly the 2_reactions_to_add folder will contain a PADMET form with the reactions to add for each studied organisms. Fourthly, the 3_PADMETs will include the PADMET files created with the structural step.
studied_organisms: you put all the species that you want to study in this folder. For each species, you create a folder and in this folder you put the GenBank file of this species. Each files and folders must have the same name. Then, the GenBank file must end with a '.gbk'.
├── studied_organisms
├── species_1
├── species_1.gbk
├── species_2
├── species_2.gbk
├── species_3
├── species_3.gbk
Warning
Remember to check the versions of Pathway Tools and MetaCyc before running the check command.
Once you have put your species in the studied_organisms folder, a check must be done on the data using:
aucome check --run=ID [--cpu=INT] [-v] [--vv]
This command will check if there is no character that will cause trouble. It will also create the proteome FASTA file from the GenBank. Also, this command will fill the 'all' row of analysis/group_template.tsv, with all the species from the studied_organisms folder. And for the annotation_based folder, if PGDBs contains folder, it will create the PADMET and the SBML corresponding to these draft in PADMETs and SBMLs folders.
A run of Pathway Tools can be launched using the command:
aucome reconstruction --run=ID [--cpu=INT] [-v] [--vv]
├── annotation_based
├── PADMETs
├── output_pathwaytools_species_1.padmet
├── output_pathwaytools_species_2.padmet
├── output_pathwaytools_species_3.padmet
├── PGDBs
├── species_1
├── PGDB dat files
├── ...
├── species_2
├── PGDB dat files
├── ...
├── species_3
├── PGDB dat files
├── ...
├── SBMLs
├── output_pathwaytools_species_1.sbml
├── output_pathwaytools_species_2.sbml
├── output_pathwaytools_species_3.sbml
├── logs
├── log_error.txt
├── resume_inference.tsv
Using the package mpwt, it will create the input file for Pathway Tools inside studied_organisms/ directory. Then, for each species that has correctly run in Pathway Tools, a species/ directory is created inside annotation_based/PGDBs/ which containing all the DAT files of the draft metabolic network; two other files will also be written: output_pathwaytools_species.padmet (in annotation_based/PADMETs/) and output_pathwaytools_species.sbml (inside annotation_based/SBMLs). At the end of the reconstruction step, the resume_inference.tsv file will be generated too. This file is useful to detect which species were not correctly run with Pathway Tools.
Orthofinder can be launched using:
aucome orthology --run=ID [-S=STR] [--orthogroups] [--cpu=INT] [-v] [--vv] [--filtering] [--threshold=FLOAT]
├── orthology_based
├── 0_Orthofinder_WD
├── species_1.faa
├── species_2.faa
├── species_3.faa
├── OrthoFinder
├── Results_MonthDay
├── Orthogroups
├── Orthologues
├── ..
├── 1_sbml_orthology
├── species_1
├── output_orthofinder_from_species_2.sbml
├── output_orthofinder_from_species_3.sbml
├── species_2
├── output_orthofinder_from_species_1.sbml
├── output_orthofinder_from_species_3.sbml
├── species_3
├── output_orthofinder_from_species_1.sbml
├── output_orthofinder_from_species_2.sbml
├── 2_padmet_orthology
├── species_1.padmet
├── species_2.padmet
├── species_3.padmet
├── 3_padmet_filtered
├── propagation_to_remove.tsv
├── reactions_to_remove.tsv
├── species_1.padmet
├── species_2.padmet
├── species_3.padmet
Then the proteome from the studied organisms and from the models will be moved to the Orthofinder_WD folder and orthofinder will be launch on them. Orthofinder result will be in this folder and in orthology_based, there will be all the metabolic network reconstructed from orthology.
To assure that no reactions are missing due to missing gene structures a genomic search is performed for all reactions appearing in one organism but not in another.
aucome structural --run=ID [--keep-tmp] [--cpu=INT] [-v]
├── structural_check
├── 0_specifics_reactions
├── species_1_VS_species_2.tsv
├── species_1_VS_species_3.tsv
├── species_2_VS_species_1.tsv
├── species_2_VS_species_3.tsv
├── 1_blast_results
├── analysis
├── species_1_VS_species_2.tsv
├── species_1_VS_species_3.tsv
├── species_2_VS_species_1.tsv
├── species_2_VS_species_3.tsv
├── tmp
├── 2_reactions_to_add
├── species_1.tsv
├── species_2.tsv
├── species_3.tsv
├── 3_PADMETs
├── species_1.padmet
├── species_2.padmet
├── species_3.padmet
In this command, spontaneous reactions will be added to each metabolic network, if they complete at least one MetaCyc pathway. Then you can spontaneous all the metabolic network with:
aucome spontaneous --run=ID [--cpu=INT] [-v] [--vv]
├── networks
├── PADMETs
├── species_1.padmet
├── species_2.padmet
├── species_3.padmet
├── panmetabolism.padmet
├── panmetabolism.sbml
├── SBMLs
├── species_1.sbml
├── species_2.sbml
├── species_3.sbml
This will output the result inside the networks folder.
You can launch the all workflow with the command:
aucome workflow --run=ID [-S=STR] [--orthogroups] [--keep-tmp] [--cpu=INT] [-v] [--vv] [--filtering] [--threshold=FLOAT]
You can launch group analysis with the command:
aucome analysis --run=ID [--cpu=INT] [--pvclust] [-v]
You must write the groups of species that you want to analyze in the analysis/group_template.tsv file: The first line of the file contains 'all' (it will launch the analysis on all the species).
When you create the repository with --init, the file will only contain 'all' row:
all |
After the check (with check or workflow command), it will add all the species that you have in your studied_organisms folder:
all | species_1 | species_2 | species_3 | species_4 |
Then you can create a new row to add another group. The name of the group is in the first column. Then for each species you add a column with the species name. You must at least give 2 species.
Example:
all | species_1 | species_2 | species_3 | species_4 |
group_1 | species_1 | species_2 | ||
group_2 | species_1 | species_2 | species_4 |
This script will create one folder for each group:
├── analysis
├── group_template.tsv
├── all
├──
├── group_1
├──
├── group_2
├──
You can launch group analysis with the command:
aucome compare --run=ID [--cpu=INT] [-v]
This script will read the group_template.tsv file and create a folder containing an upset graph comparing the group that you selected:
├── analysis
├── group_template.tsv
├── upgset_graph
├── genes.csv
├── Intervene_upset.R
├── Intervene_upset.svg
├── Intervene_upset_combinations.txt
├── metabolites.csv
├── pathways.csv
├── reactions.csv
├── tmp_data
├──