View an interactive demo report:
Jovian is a pipeline for assembling metagenomics/viromics samples from raw paired-end Illumina FastQ data and intended for batch-wise data analysis, e.g. analyzing an entire sequencing run in one workflow. It performs quality control and data cleanup, removal of human host data to facilitate GDPR-compliance, and assembly of cleaned reads into bigger scaffolds with a focus on full viral genomes. All scaffolds are taxonomically annotated and certain viral families, genera and species are genotyped to the (sub)species and/or cluster level. Any taxonomically ambiguous scaffolds that cannot be resolved by the lowest common ancestor analysis (LCA) are reported for manual inspection.
It is designed to run on High-Performance Computing (HPC) infrastructures, but can also run locally on a standalone (Linux) computer if needed. It depends on conda
and singularity
(explained here) and on these databases. Jovian's usage of singularity
is to facilitate mobility of compute.
A distinguishing feature is its ability to generate an interactive report to empower end-users to perform their own analyses. An example is shown here. This report contains an overview of generated scaffolds and their taxonomic assignment, allows interactive assessment of the scaffolds and SNPs identified therein, alongside rich and interactive visualizations including QC reports, Krona-chart, taxonomic heatmaps and interactive spreadsheet to investigate the dataset. Additionally, logging, audit-trail and acknowledgements are reported.
On first use, the paths to the required databases need to be specified (as explained here):
jovian \
--background {/path/to/background/genome.fa} \
--blast-db {/path/to/NT_database/nt} \
--blast-taxdb {/path/to/NCBI/taxdb/} \
--mgkit-db {/path/to/mgkit_db/} \
--krona-db {/path/to/krona_db/} \
--virus-host-db {/path/to/virus_host_db/virushostdb.tsv} \
--new-taxdump-db {/path/to/new_taxdump/} \
--input {/path/to/input-directory} \
--output {/path/to/desired-output}
These database paths are saved in ~/.jovian_env.yaml
so that they do not need to be supplied for future analyses. Thus, you can start a subsequent analysis with just the input and output directories:
jovian \
--input {/path/to/input-directory} \
--output {/path/to/desired-output}
Other command-line parameters can be found here alongside these examples. After the analysis finishes, this can be visualized as described here.
NB, by default Jovian is intended to be used on a grid-computing infrastructure, e.g. a High-Performance Computing (HPC) cluster with a default queue-name called bio
and through the DRMAA
abstraction layer. If you want to run it on a single computer (i.e. locally), use it on a SLURM
system, or, change the queue-name, please see the examples here.
Jovian takes as input a folder containing either uncompressed or gzipped Illumina paired-end fastq files with the extensions .fastq
, .fq
, .fastq.gz
or .fq.gz
. In order to correctly infer paired-end relationship between the R1 and R2 file the filenames must follow this regular expression (.*)(_|\.)R?(1|2)(?:_.*\.|\..*\.|\.)f(ast)?q(\.gz)?
; essentially samples must have an identical basename that contains _R[1|2]
, .R[1|2]
, _[1|2]
or .[1|2]
.
Many output files are generated in the specified output folder via --output, intended to be visualized as explained here. In light of FAIR data, and in case you want to parse these files yourself, the table below explains the intent and formatting of these output files.
Foldername | Filename | Format | Brief content description |
---|---|---|---|
root*/ | launch_report.sh | bash script (US-ASCII) | Script required to visualize the data as described here |
root*/results | all_filtered_SNPs.tsv | Tab separated flatfile (US-ASCII) | Conversion of VCF** metrics to text containing a summary of all identified minority variants, per sample and scaffold |
root*/results | all_noLCA.tsv | Tab separated flatfile (US-ASCII) | Scaffolds for which Lowest Common Ancestor Analysis taxonomic assignment was unsuccessful due to incongruent taxon assignment*** |
root*/results | all_taxClassified.tsv | Tab separated flatfile (US-ASCII) | Scaffolds with full taxonomic assignment alongside BLAST E-Value and alignment metrics |
root*/results | all_taxUnclassified.tsv | Tab separated flatfile (US-ASCII) | Scaffolds that could not be taxonomically assigned alongside alignment metrics |
root*/results | all_virusHost.tsv | Tab separated flatfile (US-ASCII) | Scaffolds assigned with host-metadata from NCBI and Mihara et al., 2016 |
root*/results | [Bacteria|Phage|Taxonomic|Virus]_rank_statistics.tsv | Tab separated flatfile (US-ASCII) | No. of unique taxonomic assignments from Superkingdom to species for Bacterial, Phage and Virus assigned scaffolds |
root*/results | igv.html | HTML file (US-ASCII) | Integrative Genome Viewer (IGVjs, Robinson et al., 2023) index.html tuned for usage as described here |
root*/results | krona.html | HTML file (US-ASCII) | Krona chart (Ondov et al., 2011) depicting metagenomic content |
root*/results | log_conda.txt | Text file (US-ASCII) | Logging of the master conda environment where the workflow runs in, part of the audit trail |
root*/results | log_config.txt | Text file (US-ASCII) | Logging of the workflow parameters, part of the audit trail |
root*/results | log_db.txt | Text file (US-ASCII) | Logging of the database paths, part of the audit trail |
root*/results | log_git.txt | Text file (US-ASCII) | Git hash and github repo link, part of the audit trail |
root*/results | logfiles_index.html | HTML file (US-ASCII) | Collation of the logfiles generated by the workflow, part of the audit trail |
root*/results | multiqc.html | HTML file (US-ASCII) | MultiQC report (Ewels et al., 2016) depicting quality control metrics |
root*/results | profile_read_counts.tsv | Tab separated flatfile (US-ASCII) | Read-counts underlying the Sample_composition_graph.html file listed below |
root*/results | profile_read_percentages.tsv | Tab separated flatfile (US-ASCII) | Read-counts, as percentage, underlying the Sample_composition_graph.html file listed below |
root*/results | Sample_composition_graph.html | HTML file (US-ASCII) | HTML barchart showing the stratified sample composition |
root*/results | samplesheet.yaml | YAML file (US-ASCII) | List of processed samples containing the paths to the input files, part of the audit trail |
root*/results | snakemake_report.html | HTML file (US-ASCII) | Snakemake (Köster et al., 2012) logs, part of the audit-trail |
root*/results | Superkingdoms_quantities_per_sample.csv | Comma separated flatfile (US-ASCII) | Intermediate file for the profile_read files listed above |
root*/results/counts | Mapped_read_counts.tsv | Tab separated flatfile (US-ASCII) | Intermediate file with mapped reads per scaffold for the all_tax*.tsv files listed above |
root*/results/counts | Mapped_read_counts-[Sample_name].tsv | Tab separated flatfile (US-ASCII) | Intermediate file with mapped reads per scaffold for the all_tax*.tsv files listed above |
root*/results/heatmaps | [Bacteria|Phage|Superkingdom|Virus]_heatmap.html | HTML file (US-ASCII) | Heatmaps for different taxonomic strata's down to species level assignment |
root*/results/multiqc_data | several files | Text file (US-ASCII) | Files required for proper functionality of multiqc.html as listed above |
root*/results/scaffolds | [Sample_name]_scaffolds.fasta | FASTA file (US-ASCII) | Scaffolds as assembled by metaSPAdes (Nurk et al., 2017) filtered by minimum length as described here |
root*/results/typingtools | all_[nov|ev|hav|hev|rva|pv|flavi]-TT.csv | Comma separated flatfile (US-ASCII) | Genotyping results from various typingtools as listed in publication |
root*/configs/ | config.yaml & params.yaml | YAML file (US-ASCII) | Intermediate configuration and parameter files which are collated in log_config.txt listed above |
root*/data/ | several folders with subfiles | various | Intermediate files, not intended for direct use but kept for audit and debugging purposes |
root*/logs/ | several folders with subfiles | Text files (US-ASCII) | Log-files of all the disparate algorithms used by Jovian which are collated in logfiles_index.html as listed above |
root*/.snakemake/ | several folders with subfiles | various | Only for internal use by Snakemake , not intended for direct use |
* this represents the "root" folder, i.e. the name your supplied to the --output
flag as listed here.
** Variant Call Format (VCF) explained.
*** Generally this is caused by scaffolds that can be assigned as both a bacterium or as a phage, e.g. temperate phages.
usage: Jovian [required arguments] [optional arguments]
Jovian: a metagenomic analysis workflow for public health and clinics with interactive reports in your web-browser
NB default database paths are hardcoded for RIVM users, otherwise, specify your own database paths using the optional arguments.
On subsequent invocations of Jovian, the database paths will be read from the file located at: /home/schmitzd/.jovian_env.yaml and you will not have to provide them again.
Similarly, the default RIVM queue is provided as a default value for the '--queuename' flag, but you can override this value if you want to use a different queue.
Required arguments:
--input DIR, -i DIR The input directory containing the raw fastq(.gz) files
--output DIR, -o DIR Output directory (default: /some/path)
Optional arguments:
--reset-db-paths Reset the database paths to the default values
--background File Override the default human genome background path
--blast-db Path Override the default BLAST NT database path
--blast-taxdb Path Override the default BLAST taxonomy database path
--mgkit-db Path Override the default MGKit database path
--krona-db Path Override the default Krona database path
--virus-host-db File Override the default virus-host database path (https://www.genome.jp/virushostdb/)
--new-taxdump-db Path Override the default new taxdump database path
--version, -v Show Jovian version and exit
--help, -h Show this help message and exit
--skip-updates Skip the update check (default: False)
--local Use Jovian locally instead of in a grid-computing configuration (default: False)
--slurm Use SLURM instead of the default DRMAA for grid execution (default: DRMAA)
--queuename NAME Name of the queue to use for grid execution (default: bio)
--conda Use conda environments instead of the default singularity images (default: False)
--dryrun Run the Jovian workflow without actually doing anything to confirm that the workflow will run as expected (default: False)
--threads N Number of local threads that are available to use.
Default is the number of available threads in your system (20)
--minphredscore N Minimum phred score to be used for QC trimming (default: 20)
--minreadlength N Minimum read length to used for QC trimming (default: 50)
--mincontiglength N Minimum contig length to be analysed and included in the final output (default: 250)
If you want to run Jovian through a certain queue-name, use the --queuename
flag with your own queue-name as specified below. Likewise, if you are using SLURM
, provide the --slurm
flag.
jovian \
--input {/path/to/input-directory} \
--output {/path/to/desired-output} \
--queuename {your_queue_name} \
--slurm # only if you are using a SLURM job scheduler
If you want to invoke it on a single computer/laptop you can invoke the --local
flag like:
jovian \
--local \
--input {/path/to/input-directory} \
--output {/path/to/desired-output}
Similarly, you can toggle to build the environments via conda
but for proper functionality please use the default mode that uses singularity
containers.
jovian \
--conda \
--input {/path/to/input-directory} \
--output {/path/to/desired-output}
When the pipeline has finished an analysis successfully, you can visualize the data via an interactive rapport as follows:
NB keep this process running for as long as you want to visualize and inspect the data
cd {/path/to/desired-output}
bash launch_report.sh ./
Subsequently, open the reported link in your browser and...
- Click 'Jovian_report.ipynb'
- When presented with popups click 'Trust'.
- Via the toolbar, press the
Cell
and then theRun all
button and wait for all data to be loaded. If you do not see the interactive spreadsheets, e.g. the "Classified scaffolds" section is empty, that means that you need to click theRun all
button!- This is a known bug, pull-requests are very welcome!
Jovian
depends on the prerequisites described here and can be downloaded and installed afterwards. After the installation, required databases can be downloaded as described here.
The workflow will update itself to the latest version automatically. This makes it easier for everyone to use the latest available version without having to manually check the GitHub releases. If you wish to run Jovian without the updater checking for a new release, then add the --skip-updates
flag to your command. in this case you wil not be notified if there is a new release available.
- Before you download and install Jovian, please make sure Conda is installed on your system and functioning properly! Otherwise, install it via these instuctions. Conda is required to build the "main" environment and contains all required dependencies.
- Jovian is intended for usage with
singularity
, that is the only way we can properly validate functionality of the code and it helps reduce maintenance. As such, please make sure Singularity, and its dependency Go are installed properly. Otherwise, install it via these instructions. Singularity is used to build all sub-units of the pipeline.
Use the following command to download the latest release of Jovian and move to the newly downloaded jovian/
directory:
git clone https://github.com/DennisSchmitz/jovian; cd jovian
First, make sure you are in the root folder of the Jovian repo. If you followed the instructions above, this is the case.
- Install the proper dependencies using
conda create --name Jovian -c conda-forge mamba python=3.9 -y; conda activate Jovian; mamba env update -f mamba-env.yml; conda deactivate
- Build the Python package via
conda activate Jovian; pip install .
Jovian
usessingularity
by default, this must be installed on your computer or on your HPC by your system-admin. Alternatively, use the--conda
flag to useconda
, but only thesingularity
option is validated and supported.- Follow the steps described in the databases section.
- Jovian is now installed! You can verify the installation by running
Jovian -h
orJovian -v
which should return the help-document or installed version respectively. You can start Jovian from anywhere on your system as long as the Jovian conda-environment is active. If this environment isn't active you can activate it withconda activate Jovian
.
Several databases are required before you can use Jovian
for metagenomics analyses. These are listed below. Please note, these steps requires Singularity
to be installed as described in the installation section.
NB, for people from the RIVM working on the in-house grid-computer, the following steps have already been performed for you.
- Download the
krona
db. NB this step temporarily requires a large amount of storage space, takes some time to complete and might require you to retry it a couple of times.mkdir /to/desired/db/location/krona_db/; cd /to/desired/db/location/krona_db/
singularity pull --arch amd64 library://ds_bioinformatics/jovian/krona:2.0.0
singularity exec --bind "${PWD}" krona_2.0.0.sif bash /opt/conda/opt/krona/updateTaxonomy.sh ./
singularity exec --bind "${PWD}" krona_2.0.0.sif bash /opt/conda/opt/krona/updateAccessions.sh ./
rm krona_2.0.0.sif
- Download the NCBI
nt
andtaxdb
databases. NB update time-stamp accordingly, list available time-stamps withaws s3 ls --no-sign-request s3://ncbi-blast-databases/
. Importantly, use the same time-stamps for bothnt
andtaxdb
.- NB this requires
awscli
to be installed. mkdir /to/desired/db/location/nt/; cd /to/desired/db/location/nt/
aws s3 sync --no-sign-request s3://ncbi-blast-databases/[enter_timestamp_here]/ . --exclude "*" --include "nt.*"
aws s3 sync --no-sign-request s3://ncbi-blast-databases/[enter_timestamp_here]/ . --exclude "*" --include "taxdb*"
- NB this requires
- Download the
mgkit
database:mkdir /to/desired/db/location/mgkit/; cd /to/desired/db/location/mgkit/
singularity pull --arch amd64 library://ds_bioinformatics/jovian/mgkit_lca:2.0.0
singularity exec --bind "${PWD}" mgkit_lca_2.0.0.sif download-taxonomy.sh
rm taxdump.tar.gz
wget -O nucl_gb.accession2taxid.gz ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz; wget -O nucl_gb.accession2taxid.gz.md5 https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz.md5; md5sum -c nucl_gb.accession2taxid.gz.md5
gunzip -c nucl_gb.accession2taxid.gz | cut -f2,3 > nucl_gb.accession2taxid_sliced.tsv; rm nucl_gb.accession2taxid.gz*
rm mgkit_lca_2.0.0.sif
- Download the
virus_host_db
:mkdir /to/desired/db/location/virus_host_db/; cd /to/desired/db/location/virus_host_db/
wget -O virushostdb.tsv ftp://ftp.genome.jp/pub/db/virushostdb/virushostdb.tsv
- Download the NCBI
new_taxdump
database:mkdir /to/desired/db/location/new_taxdump/; cd /to/desired/db/location/new_taxdump/
wget -O new_taxdump.tar.gz https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz; wget -O new_taxdump.tar.gz.md5 https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz.md5
if md5sum -c new_taxdump.tar.gz.md5; then tar -xzf new_taxdump.tar.gz; for file in *.dmp; do gawk '{gsub("\t",""); if(substr($0,length($0),length($0))=="|") print substr($0,0,length($0)-1); else print $0}' < ${file} > ${file}.delim; done; else echo "The md5sum does not match new_taxdump.tar.gz! Please try downloading again."; fi
- Download the HuGo reference via:
- NB this requires
awscli
to be installed. mkdir /to/desired/db/location/HuGo/; cd /to/desired/db/location/HuGo/
aws s3 --no-sign-request --region eu-west-1 sync s3://ngi-igenomes/igenomes/Homo_sapiens/NCBI/GRCh38/Sequence/WholeGenomeFasta/ ./ --exclude "*" --include "genome.fa*
gawk '{print >out}; />chrEBV/{out="EBV.fa"}' out=temp.fa genome.fa; head -n -1 temp.fa > nonEBV.fa; rm EBV.fa temp.fa; mv nonEBV.fa genome.fa
Remove the EBV fasta record in this genome.singularity pull --arch amd64 library://ds_bioinformatics/jovian/qc_and_clean:2.0.0
singularity exec --bind "${PWD}" qc_and_clean_2.0.0.sif bowtie2-build --threads 8 genome.fa genome.fa
rm qc_and_clean:2.0.0
- NB this requires
Please cite this paper as follows:
#TODO update after publication
This study was financed under European Union’s Horizon H2020 grants COMPARE and VEO (grant no. 643476 and 874735) and the NWO Stevin prize (Koopmans).
Layout of this README was made using BioSchemas' Computational Workflow schema as a guideline