A Nextflow workflow for bacterial gene order analysis, with outputs easily explorable through its partner visualization application Coeus.
- EXTRACTION: Identifies all genes of interest present in provided assemblies and extracts neighborhoods consisting of
num_neighbors
genes upstream and downstream from a given focal gene. - CLUSTERING: Derives similarity and distance matrices for each AMR gene, and applies three types of clustering algorithms (Hierarchical: UPGMA, Graph-based: MCL, Density-based: DBSCAN) to identify similarities and differences between gene neighborhoods across genomes.
- FILTERING: Collapses identical gene neighborhoods prior to visualization for increased efficiency. It retains one surrogate neighborhood instead of multiple identical ones, and outputs a textfile for each gene that had neighborhoods collapsed to indicate which genome is standing in for which other genomes.
- VISUALIZATION: Pre-computes gene order visualizations, cluster visualizations, and similarity/distance summary histograms. Clustering visualizations can also be dynamically generated with Coeus using similarity and distance matrices calculated through this workflow.
- (TBD) PREDICTION: If analyzing antimicrobial resistance (AMR) genes using annotations from AMR detection software, create gene-order based embeddings in the spirit of word embeddings and train ML classifiers on the representations to predict candidate AMR genes. Optional; must be enabled by the user when invoking workflow.
Gene-Order-Workflow can be run on either:
- Genbank files and their corresponding assemblies.
- Genbank files, their corresponding assemblies, and annotations from external software (e.g. the Resistance Gene Identifier).
- INPUT FILE (.txt)
- If using Genbank files only: input file should be a textfile with one gene name per row for every gene you want to analyze.
geneA geneB ... geneX
- If using Genbank files and annotations: input file should be a textfile indicating the column names (values between '<>' should be replaced with your column names) in your annotation file that correspond to the
Contig
identifier,Gene_Start
,Gene_Stop
, andGene_Name
for your data. An example that can be used with RGI can be found insample_data/rgi_input.txt
.
<contig_col> = Contig <gene_name_col> = Gene_Name <start_col> = Gene_Start <stop_col> = Gene_End
- ASSEMBLIES ( .faa | .fa | .fna ) in a single directory. Should be annotated and share the same locus tags as those found in the Genbank files.
- GENBANK FILES ( .gbk | .gb ) in a single directory. Should correspond to the assembly files.
- ANNOTATIONS (e.g. RGI textfiles) in a single directory.
Parameter | Type | Description |
---|---|---|
input_file_path | Required | Path to input textfile containing either a) gene names to extract if providing Genbank files and assemblies or b) column names of required columns if additionally providing annotations (see examples above). |
assembly_path | Required | Path to directory containing assembly files (.faa, .fa, .fna). |
gbk_path | Required | Path to directory containing Genbank files (.gbk, .gb). |
extract_path | Optional | Path to annotation textfiles (.txt). |
num_neighbors | Optional | Neighborhood size to extract. Should be an even number N, such that N/2 neighbors upstream and N/2 neighbors downstream will be analyzed. Default: 10. |
percent_cutoff | Optional | Cutoff percentage of genomes a gene should be present within to be included in extraction and subsequent analysis. Should a float between 0 and 1 (e.g., 0.25 means only genes present in a minimum of 25% of genomes are kept). Default: 0.25. |
inflation | Optional | Inflation hyperparameter value for Markov Clustering Algorithm. See the algorithm documentation for details. Default: 2. |
epsilon | Optional | Epsilon hyperparameter value for DBSCAN clustering. See the algorithm documentation for details. Default: 0.5. |
minpts | Optional | Minpts hyperparameter value for DBSCAN clustering. See the algorithm documentation for details. Default: 5. |
outdir | Optional | Path to output directory. Default: 'results' within repository. |
-
Install
Nextflow
(>=21.10.3
) -
Install any of
Docker
,Singularity
(you can follow this tutorial),Podman
,Shifter
orCharliecloud
for full pipeline reproducibility (you can useConda
both to install Nextflow itself and also to manage software within pipelines. Please only use it within pipelines as a last resort; see docs). -
Start running your own analysis!
nextflow run main.nf -profile <docker/singularity/podman/shifter/charliecloud/conda/institute> \ --input_file_path <path_to_input_file> \ --assembly_path <path_to_assembly_dir> \ --extract_path <path_to_annotation_dir> \ --gbk_path <path_to_genbank_dir> \ --num_neighbors <int_val> --percent_cutoff <float_val>
Please ensure you've formatted your input file correctly for the use case you need (see Data section above).
Some general notes regarding running on HPC environments:
- Please check nf-core/configs to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use -profile in your command. This will enable either docker or singularity and set the appropriate execution settings for your local compute environment.
- If you are using Singularity, then the pipeline will auto-detect this and attempt to download the Singularity images directly as opposed to performing a conversion from Docker images. If you are persistently observing issues downloading Singularity images directly due to timeout or network issues then please use the --singularity_pull_docker_container parameter to pull and convert the Docker image instead.
- When running Nextflow on HPC environments with the Slurm executor, some have reported persistent SIGBUS errors. If describes you, you may find it helpful to consult this suggested fix.
nf-core/geneorderanalysis was originally written by Julia Lewandowski.
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md
file.
You can cite the nf-core
publication as follows:
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.