/nf-anvio-pangenome

Analyze a set of genomes with the anvi'o pangenome pipeline

Primary LanguageNextflow

nf-anvio-pangenome

Analyze a set of genomes with the anvi'o pangenome pipeline

The motivation here is to wrap up the set of anvi'o functions which would require some familiarity of the command line, and isolate those within a reproducible workflow which can be executed in a single command. The workflow can also take advanage of cloud computing in case the number of genomes is larger than your computer can easily handle.

Getting Started

The example below assumes the following:

Input Data

In order to run this workflow, you need:

  1. A set of genomes in FASTA format
  2. A text file describing each of those genomes including a name, as well as additional optional metadata

The text file describing the genomes must be comma-delimited, at a minimum including genome and name (genome must be the first column).

NOTE: name must only contain alphanumeric characters (and _), and cannot start with a number. genome must be the complete path to the file not only tests/genomes. For example my file path is /Users/willfrohlich/Desktop/Hutch/G.Vaginalis/tests/genomes/.

genome,name,species
tests/genomes/GCA_004336685.1_ASM433668v1_genomic.fna,ATCC_14018,Gardnerella vaginalis
tests/genomes/GCA_003397665.1_ASM339766v1_genomic.fna,Ugent_09_07,Gardnerella vaginalis
tests/genomes/GCA_003397605.1_ASM339760v1_genomic.fna,Ugent_25_49,Gardnerella vaginalis
tests/genomes/GCA_003397755.1_ASM339775v1_genomic.fna,Ugent_09_01,Gardnerella vaginalis
tests/genomes/GCA_003293675.1_ASM329367v1_genomic.fna,UGent_06_41,Gardnerella leopoldii
tests/genomes/GCA_003397635.1_ASM339763v1_genomic.fna,Ugent_09_48,Gardnerella leopoldii
tests/genomes/GCA_003397585.1_ASM339758v1_genomic.fna,Ugent_18_01,Gardnerella piotii
tests/genomes/GCA_003397615.1_ASM339761v1_genomic.fna,Ugent_21_28,Gardnerella piotii
tests/genomes/GCA_003397705.1_ASM339770v1_genomic.fna,GS_9838_1,Gardnerella swidsinskii
tests/genomes/GCA_003397745.1_ASM339774v1_genomic.fna,GS_10234,Gardnerella swidsinskii

The additional columns to the right will be imported as metadata into the anvi'o pangenome viewer.

Running the Workflow

To run the workflow, make a BASH script containing the following commands:

#!/bin/bash

set -e

# Change "EXAMPLE_OUTPUT" to something that describes this group of genomes
OUTPUT_NAME=EXAMPLE_OUTPUT

# Get the most recent version of the workflow
nextflow pull fredhutch/nf-anvio-pangenome

# Run the workflow
NXF_VER=20.04.1 \
nextflow \
    -c ~/nextflow.config \
    run \
    fredhutch/nf-anvio-pangenome \
    --sample_sheet species_test.csv \
    --output_folder ./ \
    --output_name $OUTPUT_NAME \
    --min_alignment_fraction 0 \
    --category_name species \
    --mcl_inflation 10 \
    -work-dir #insert your work dir
    -process.queue mixed \
    -resume

Picking Parameters

You can use the --mcl_inflation parameter to control how similar two genes have to be for grouping together. The Meren lab recommends using a value of 10 for comparing closely related genomes (for example, strains from the same species), and a value of 2 for comparing distantly related genomes (for example, across species).

You can use the --category_name parameter to select which column of metadata you want to use to identify groups within your genomes and find functions that are enriched in those groups: functions that are characteristic of these genomes, and predominantly absent from genomes from outside this group. This data will be available in your output folder and be titled YOUR_PANGENOME-enriched-functions-category.txt

You can use the --min_alignment_fraction parameter to eliminate ANI scores between two genomes if the alignment fraction is less than you deem trustable. When you set a value, anvi'o will go through the ANI results, and set percent identity scores between two genomes to 0 if alignment fraction between either of them is less than the parameter described here. The default is 0.0, so every hit is reported, but you can choose any value between 0.0 and 1.0.

You can use the --ani_program parameter to select which program is used to compute the average nucleotide identity (ANI), a pairwise measure of similarity between genomes. The default is pyANI (which is more precise, but slower). Other supported options are fastANI and sourmash. The anvi'o documentation related to this parameter can be found here.

You can use the --minbit parameter to adjust the stringency of clustering for groups of amino acid sequences. The anvi'o documentation describes this with:

The minbit score between two sequences goes to 1.0 if they are very similar over the entire
length of the shorter amino acid sequence, and goes to 0.0 if

    (1) they match over a very short stretch compared even to the length of the shorter amino acid sequence, or
    
    (2) the match between sequence identity is low.
    
The default minbit is 0.5, but you can change it using the parameter --minbit.

You can use the --min_occurrence parameter to adjust the threshold used to filter out which genes are used in the analysis. The value of this parameter determines the number of genomes in which a particular gene cluster must be found within for that gene cluster to be included in the downstream analysis. Default: 1.

You can use the --distance parameter to set the distance metric used for clustering. Default: euclidean.

You can use the --linkage parameter to set the linkage method used for clustering. Default: ward.

Visualizing the Pan-Genome

To launch the visual browser for the pan-genome, run the following command (replacing the database files and folder as appropriate):

docker \
    run \
    -p 127.0.0.1:80:8080/tcp \
    -v $PWD:/share \
        meren/anvio:7 \
            anvi-display-pan \
            -g /share/tests/output/$OUTPUT_NAME-GENOMES.db \
            -p /share/tests/output/$OUTPUT_NAME-PAN.db

For more details on how to setup and navigate this visual browser, check out this presentation

Or, check out the amazing Anvi'o documentation

Example Data