These scripts are provided to help users visualise their bacterial pangenome dataset in the network analysis suite Graphia. At present, they should be considered as in beta, so please check your outputs make sense, and raise any issues or bugs.
GraPPLE was initially developed to work with the output of PIRATE, though any gene presc/absc matrix from other suitable tools (Roary, Panaroo, PPanGGOLiN etc.) can be used as input for the pairwise similarity script. Currently, synteny graphs from PIRATE and Panaroo are supported (with conversion needed for PIRATE, see below).
Contact: j.d.harling-lee [at] roslin.ed.ac.uk
- Python 3.6
Currently, each script is run individually, so cloning the repository and running whichever scripts you need is the simplest usage method.
git clone https://github.com/JDHarlingLee/GraPPLE
Calculates the pairwise similarity between genomes and/or genes from a binary matrix
Example: python pw_similarity.py -i binary_presc_absc.tsv -o example1 -r "both" -s "jaccard" -f 0.8 -e 0.8 -t 2
- Requires the gene presc/absc matrix as a binary, tab separated file (see below for help converting file to binary)
-r
specifies the run_type, and can be set as "isolates", "genes" or "both".- Optionally include metadata for genomes or genes (these can also be added later)
Used to add metadata from a table to a graph in .layout format
Example: python metadata_to_layout.py -l example1_gene_pw_sim.layout -m gene_info.tsv -s pirate_gene_headers.txt -r "copy"
- First column of metadata table must match the names of values in the layout file (the "Node Name")
- If using the gene information from the pangenome tool (e.g. all_alleles.tsv from PIRATE), remember to use the
-s
variable to specify which columns to add -s
columns should be a list in .txt format, with one column name per row - e.g. thepirate_gene_headers.txt
file provided in the main GraPPLE folder- Metadata can also be added to a network in Graphia through the GUI, see User Guide
Use [script] --help
to see full individual script options.
These scripts are specific to users of PIRATE.
Converts PIRATE .edge file to .layout file for load to Graphia
Example: python edges_to_layout.py -e pangenome.edges -o example_graph
- Calls py_metadata_to_layout.py to add provided metadata to file
- Note default behaviour is to group genes where they appear twice, as directionality is not currently supported
- If you are wanting to investigate the synteny graph at a specific id threshold (e.g. 90%), you may need to recreate the graph file, see below.
Utilises adapter scripts from PIRATE to recreate the synteny graph at a particular threshold
Example: bash generate_edges.sh -i PIRATE.all_alleles.90.tsv -o synteny-graph-90 -p /path/to/PIRATE/
- Requires input of a presc/absc matrix at a single threshold (see Other Scripts)
- Requires path to the installation folder of PIRATE (e.g. pip/conda)
- This script creates "allelesAsGeneFamilies" files, as it replaces the gene_family field with alleles in the presc/absc matrix
Provided in the 'scripts' folder are some other useful scripts, including a general script for subsetting PIRATE output files (post_pirate_processing.sh), and a script to create a simple binary file from pangenome tool ouputs (gene_matrix_to_binary.py). Again, use [script] -h
to see options available.
Plots the gene presc/absc profiles of a range of clusters, or a specific set of clusters, with relative metadata for inspection of associations and gene presc/absc patterns. Requires a binary gene presc/absc matrix (-g
as .tsv), a list of genes and their clusters (-c
as .csv; exported from Graphia) and isolate metadata (-m
as .csv; recommended to keep to 3 or fewer categories).
-N
specifies the number of metadata groupings to use - this is defaults and is limited to 11, to avoid over-colouring/complicating the plots. Provide either a single number for all, or a comma separated list (e.g. "5,10,3") to specify for each column in the isolate metadata file-l
to give a list of clusters to specifically plot. Provide as a comma separated list (e.g. "1,2,5,10")-p
to give colour (or list of colours) for gene cluster profile - name or hex code-s
&-e
to give start and end cluster numbers (instead of-l
) - this will plot all clusters between these two numbers
Example: Rscript plot_gene_cluster_profiles.R -g gene_presc_absc_binary.tsv -c gene_clusters_mcli150.csv -m isolate_metadata.csv -o gene_cluster_plots/ -l "1,2,3,4" -N "4,10" -p "blue"
Tested in R v3.6.3. Requires:
- dplyr
- ggplot2
- cowplot
- optparse
Many of these scripts were initially developed from PIRATE outputs, and some make use of the excellent adapter scripts from the PIRATE repository. We also thank Sion Bayliss for his advice and useful discussions.
Now in press at BMC Bioinformatics