ncov-dehoster

About:

Nextflow pipeline that removes human reads from SARS-CoV-2 Illumina or Nanopore sequencing data. Basic details for the three available pipelines are as follows:

Illumina Paired Pipeline (v0.2.0) - Competitive mapping approach using bwa mem to remove human reads from input fastq file pairs while maintaining as many viral reads as possible.

Nanopore Minimap2 Fastq Pipeline (v0.2.0) - Competitive mapping approach using minimap2 to remove human reads from either input fastq files or barcoded fastq directories while maintaining as many viral reads as possible.

Strict demultiplexing is highly recommended before running (unable to do so after and it improves downstream analyses)
Optional fast5 dehosting available with argument --fast5_directory [dir]

Nanopore Nanostripper Fast5 Pipeline (v0.1.0-Developmental) - Dual mapping approach based around nanostripper and guppy to generate dehosted, demultiplexed fast5 and fastq files from input fast5 files.

Currently Developmental, the nanopore nanostripper dehosting pipeline is slow (full 96 sample run takes 6+ hours) and not setup well for external users which makes it unrecommended to try. You will have to provide a path to the guppy environment(s), the nanostripper environment and the nanostripper tool itself with the following arguments to generate a completely dehosted run and then it still may not work correctly:
- --guppyGPU <path/to/guppyGPU/env>
- --guppyCPU <path/to/guppyCPU/env>
- --nanostripper_env_path <path/to/nanostripper/env
- --nanostripper_tool_path <path/to/nanostripper/tool

Index

About

Changelog

Dependencies and Data

Dependencies
Data Required

Quick Start

Illumina
Nanopore Minimap2 Fastq
Nanopore Nanostripper Fast5 (Developmental)

Pipeline Full Details

Illumina Paired Fastq Dehosting Pipeline
Nanopore Fastq Minimap2 Dehosting and Regeneration Pipeline
Nanopore Fast5 Nanostripper Dehosting and Regeneration Pipeline (Developmental)

Profiles

Other Notes

Changelog Highlights

Release v0.2.0

Better user parameter options
- Added --keep_ref_id <ID> to allow user to pick what reference ID they want to keep in output fastq files
Process and tool version output added

Release v0.1.0

Initial release

Dependencies and Data Required

Dependencies

Illumina Pipeline

csvtk
bwa
pysam
samtools

Nanopore Minimap2 Fastq Pipeline

artic
csvtk
minimap2
ont-fast5-api
pip
pysam

Nanopore Nanostripper Fast5 Pipeline

artic
csvtk
guppy-cpu
guppy-gpu
nanostripper

Data Required

Human Reference Sequence Fasta file

Get the reference genome from NCBI with the following:

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
gunzip GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz

Note that the newly released human reference genome (from April 2022 I believe) has not yet been tested but should work with the process

Quick Start:

Simple commands to start dehosting along with the dependencies required . Greater detail on what each pipeline entails are found below in the individual Pipeline Details Sections. You can also do nextflow run phac-nml/ncov-dehoster --help to get more info on the available processes

Illumina

Run the Illumina pipeline with the following command:

nextflow run phac-nml/ncov-dehoster -profile conda --illumina --directory <path/to/paired_reads/dir> --human_ref <path/to/reference>

Minimum computational specs required for the Illumina dehosting pipeline are 3+ cpu and 16+G memory. More resources may be needed depending on the size of the input fastq files.

Nanopore

Minimap2 Fastq Pipeline

Run the basic Minimap2 Fastq pipeline with the following command:

nextflow run phac-nml/ncov-dehoster -profile conda --nanopore --minimap2 --fastq_directory <path/to/fastqs> --human_ref <path/to/reference> --run_name 'whatever_you_want'

Minimum computational specs required for the minimap2 nanopore fastq host removal pipeline are 2+ cpu and 12+G memory. More resources may be needed depending on the size of the input fastq files.

Nanostripper Fast5 Pipeline (Developmental)

Requires conda due to use of how the guppy environments are input as parameters at the moment (and that they cannot be made with conda as they are proprietary)

Run the full Nanopore Nanostripper pipeline with the following command:

nextflow run phac-nml/ncov-dehoster -profile conda --nanopore --nanostripper --fast5_directory <path/to/fast5_pass/> --human_ref <path/to/reference> --run_name 'whatever_you_want' --guppyCPU </path/to/conda_env/guppy-cpu/> --guppyGPU </path/to/conda_env/guppy-gpu/>

You can also just generate dehosted fast5 files with nanostripper with no guppy environment specified. Guppy is proprietary software of ONT Technologies so you must create your own environment for it

Minimum computational specs required for the Nanopore dehosting pipeline are 16+ cpu and 68+G memory. Configuring a GPU is highly recommended to run basecalling or the pipeline will take >12 hours to run the basecalling step (with a GPU it will be about 1-4 hours).

Pipeline Details:

Full details for each pipeline including:

Inputs
Outputs
Running
Process
Extra info

Illumina Paired Fastq Dehosting Pipeline

Inputs

Specify running Illumina dehosting with the --illumina flag.

Other arguments include:

Parameter	Description	Default	Optional
directory	Directory containing paired fastq reads	None	No
human_ref	Fasta formatted human reference genome	None	No
cov2019_ref	Fasta formatted Sars-CoV-2 reference genome	data/nCoV-2019.reference.fasta	No
keep_ref_id	Reference genome header id to keep	MN908947.3	No
keep_min_map_quality	Minimum mapping quality of covid reads to keep	60	No
remove_min_map_quality	Minimum mapping quality of the human reads to remove	0	No
composite_bwa_index	Directory containing BWA indexes for a composite human/viral reference -- **Speeds up analysis if given	None	Yes

Running

Full instructions on how to easily install and run the Illumina dehosting pipeline.

Setup all necessary resources:
- Conda with nextflow installed into an environment
- A copy of the hg38 human reference genome
- Folder with paired fastq or fastq.gz files to dehost

Activate the conda environment and run the pipeline

nextflow run phac-nml/ncov-dehoster -profile conda --illumina --directory <path/to/reads> --human_ref <path/to/hg38.fa>

All subsequent runs can be extremely sped up using the full/path/to/results/humanBWAIndex folder as follows

nextflow run phac-nml/ncov-dehoster -profile conda --illumina --directory <path/to/reads> --human_ref <path/to/hg38.fa> --composite_bwa_index </full/path/to/results/humanBWAIndex/>

Note: At the moment, you need to specify the full path to the made index or the pipeline will error out! But adding this removes the > 1 hour indexing step!

Outputs

Found in ./results/ directory, the outputs for the Illumina pipeline include:

compositeMAPs --> Folder containing the composite mapping sorted BAM files along with each samples flagstats output
dehostedBAMs --> Folder containing the dehosted composite BAM files
dehosted_paired_fastqs --> Folder containing the final dehosted, paired fastq reads (main and final output)
removal_summary --> CSV file containing read removal metrics
process_versions --> YML file containing process names with their tools and versions associated for tracking

Process

Composite genome generation and Indexing

Generate bwa index to allow competitive mapping with bwa mem.
- Inputs:
  - Sars-CoV-2 reference fasta genome
  - Human reference fasta genome
- Outputs:
  - Folder of bwa mem indexes of the composite genome
- Tools:
  - bwa index
Competitive mapping

Use competitive mapping to find human and chimeric reads while retaining known viral reads that may also map weakly to the human reference.
- Inputs:
  - Bwa mem indexes (.amb, .ann, .pac, .sa)
  - Paired fastq reads
- Outputs:
  - Sorted BAM file
  - flagstats output
- Tools:
  - bwa mem
  - samtools sort
Host read removal and quality filtering

Use the viral reference contig name (default MN908947.3) to determine which reads mapped to the human reference with >= the mapping quality score (default 0). Then pull out only proper pairs matching the viral reference and make sure they are above the quality wanted (default 60).
- Inputs:
  - Sorted BAM file
- Outputs:
  - Dehosted BAM file
- Tools:
  - dehost_illumina.py
    - pysam
Regenerate fastq files

Regenerate paired fastq files with samtools fastq
- Inputs:
  - Dehosted BAM file
- Outputs:
  - Dehosted, paired fastq reads files
- Tools:
  - samtools fastq

Additional Info

Using --composite_bwa_index </path/to/composite_index_folder/ will speed up the analysis as the index won't need to be generated
- Example generation: cat human.fa sarsCoV2.fa > composite.fa && bwa index -a bwtsw composite.fa
If you are using a different SarsCoV2 reference genome than the one provided, make sure to match the --keep_ref_id so that the competitive mapping will pull out viral reads
The keep_min_map_quality and remove_min_map_quality can be adjusted depending on the tolerance for human reads. The defaults are set at the most strict that then can be

Host removal status can be checked with Kraken2 using a command similar to the following one each sample in the run:

kraken2 --confidence 0.1 --db PATH/TO/kraken2_covid19_human_db/ --threads <THREADS> --report <SAMPLE>-REPORT.tsv --output <SAMPLE>-kraken.tsv <FASTQ_FILE_IN>

And then checking the total human reads with:

grep -P "\s+9606\s+" -H *-kraken.tsv > human.tsv | wc -l human.tsv

Nanopore Fastq Minimap2 Dehosting and Regeneration Pipeline

Inputs

Specify running Nanopore fastq dehosting using the minimap2 pipeline with both the --nanopore and --minimap2 flags as input.

Other arguments include:

Parameter	Description	Default	Optional
fastq_directory	Directory containing either individual fastq files or barcoded fastq files	None	No
human_ref	Fasta formatted human reference genome	None	Yes - Need to pass --human_minimap2_index instead
composite_minimap2_index	Path to human minimap2 index (*.mmi)	None	Yes - Need to pass --human_ref instead
cov2019_ref	Fasta formatted Sars-CoV-2 reference genome	data/nCoV-2019.reference.fasta	No
run_name	Specify output run name to separate results	None	No
min_length	Minimum fastq read length to keep	400	Yes
max_length	Maximum fastq read length to keep	2400	Yes
min_read_count	Minimum read count required to output results	1	Yes
fast5_directory	Directory of run associated fast5 files to be dehosted	None	Yes
keep_ref_id	Reference genome header id to keep	MN908947.3	No
flat	Output flat fastq_pass folder instead of sample name subdirectories (better for folder input of named files)	None	Yes

Running

Full instructions on how to easily install and run the Nanopore Minimap2 fastq dehosting pipeline.

Setup all necessary resources:
- Conda with nextflow installed into an environment OR all tools installed and available to skip using conda
- A copy of the hg38 human reference genome OR a composite minimap2 index (mmi file) for the human reference genome and Sars-CoV-2 genome
- One of the following as an input to dehost:
  - A directory with barcoded directories containing fastq files
  - A directory with named singular fastq files (one fastq file per sample)
(OPTIONAL) Strictly demultiplex the data (if using barcoded directories as input) before running the pipeline
- Demultiplex the data using guppy_barcoder (proprietary from ONT) and the --require_barcodes_both_ends flag to make sure that barcodes are present on each end of the reads.
  - There are some spots in the genome where this helps in resolution when analyzing the data
- Note that the output fastq files are not able to be demultiplexed anymore so if you wish to strictly demultiplex the data it is advised to do it before running this host-removal pipeline.
  - May be able to be fixed in a later version

Ensure nextflow is available and then run the pipeline

Basic command with conda to generate just dehosted fastq files out:

nextflow run phac-nml/ncov-dehoster -profile conda --nanopore --minimap2 --fastq_directory <path/to/reads> --human_ref <path/to/hg38.fa>

Command to generate dehosted fastq and fast5 files along with create a new sequencing_summary file:

nextflow run phac-nml/ncov-dehoster -profile conda --nanopore --minimap2 --fastq_directory <path/to/reads> --human_ref <path/to/hg38.fa> --fast5_pass <path/to/fast5>

All subsequent runs can be sped up using the full/path/to/results/compositeMinimapIndex/composite_ref.mmi file as follows:

nextflow run phac-nml/ncov-dehoster -profile conda --nanopore --minimap2 --fastq_directory <path/to/reads> --human_ref <path/to/hg38.fa> --composite_minimap2_index <full/path/to/results/compositeMinimapIndex/composite_ref.mmi>

Note: At the moment, you need to specify the full path to the made index or the pipeline will error out! But adding this removes the long indexing step!

Outputs

Found in ./results/<run_name>/run/ directory, the outputs for the Nanopore pipeline include:

fastq_pass --> Folder containing the finished dehosted fastq files that pass the minimum read check (default: 1) which can be used for any other analyses
- Separated by either barcode## or sample name depending upon input fastq folder structure
- If --flat is passed, all fastq files will be in the main fastq_pass directory instead of named subdirectories.
fast5_pass --> Folder containing the finished dehosted fast5 files that can be used for any other analyses
- Separated by either barcode## or sample name depending upon input fastq folder structure
- Only if --fast5_directory flag is passed
sequencing_summary --> Simplified dehosted sequencing summary output only containing the read name and its specific fast5 file to allow data to be re-ran.
- Only if --fast5_directory flag is passed
removal_summary --> CSV file containing read removal metrics. Found in ./results/<run_name>/ instead
- Shows the number and percentage of reads kept
- Shows if the sample failed the minimum read filter
process_versions --> YML file containing process names with their tools and versions associated for tracking

The output structure is setup as such so that the run_name organizes the sequencing data in the run folder and all analyses can be done in the run_name folder to separate out runs better.

Process

Composite genome generation and indexing with Minimap2

Generate a composite reference genome and index it with minimap2 (skipped if --composite_minimap2_index is given)
- Inputs:
  - Sars-CoV-2 reference fasta genome
  - Human reference fasta genome
- Outputs:
  - composite reference index
- Tools:
  - minimap2
Fastq size selection

Filtering out fastq reads that do not fit the expected size range given (default 400 - 2400)
- Inputs:
  - Fastq files or barcoded directories (based on what is given on the CL)
- Outputs:
  - Size selected fastq file associated with its sample name
- Tools:
  - artic guppyplex
Competitive Mapping

Use competitive mapping to find human and chimeric reads while retaining known viral reads that may also map weakly to the human reference.
- Inputs:
  - Size selected fastq file and its sample name
  - Composite minimap2 reference (*.mmi file)
- Outputs:
  - Sorted BAM file
- Tools:
  - minimap2
  - samtools
Host read removal and quality filtering

Use the viral reference contig name (default MN908947.3) to determine which reads mapped to the human reference with >= the mapping quality score (default 0). Then pull out only reads mapping to the viral reference and make sure they are above the quality wanted (default 60).
- Inputs:
  - Sorted BAM file
- Outputs:
  - Dehosted BAM file
  - Sample specific removal CSV
- Tools:
  - dehost_nanopore.py
    - pysam
Regenerate fastq files

Regenerate paired fastq files with samtools fastq
- Inputs:
  - Dehosted BAM file
- Outputs:
  - Dehosted fastq file
- Tools:
  - samtools fastq
(OPTIONAL) Regenerate fast5 files and the sequencing summary file if --fast5_directory argument is given

Regenerate dehosted fast5 files from the fastq files using fast5_subset
- Inputs:
  - Dehosted fastq files
  - Fast5 directory associated with the run
- Outputs:
  - dehosted fast5 files
  - sequencing_summary.txt
- Tools
  - awk
  - fast5_subset

Additional Info

Providing the composite minimap2 index genereated in a previous analysis will speed other ones

Inputs can be fastq barcode directories or a directory of fastq files

Example barcode directory:

fastq_in
├── barcode01
│   ├── X_pass_barcode01_556763c2_0.fastq
│   ├── X_pass_barcode01_556763c2_10.fastq
│   └── X_pass_barcode01_556763c2_11.fastq
├── barcode02
│   ├── X_pass_barcode02_556763c2_0.fastq
│   ├── X_pass_barcode02_556763c2_10.fastq
│   └── X_pass_barcode02_556763c2_11.fastq
etc.

Example fastq directory:

fastq_in
├── ABC.fastq
├── DEF.fastq
├── GHI.fastq
├── JKL.fastq
etc.

Host removal status can be checked with Kraken2 using a command similar to the following one each sample in the run:

kraken2 --confidence 0.1 --db PATH/TO/kraken2_covid19_human_db/ --threads <THREADS> --report <SAMPLE>-REPORT.tsv --output <SAMPLE>-kraken.tsv <FASTQ_FILE_IN>

And then checking the total human reads with:

grep -P "\s+9606\s+" -H *-kraken.tsv > human.tsv | wc -l human.tsv

Nanopore Fast5 Nanostripper Dehosting and Regeneration Pipeline (Developmental)

Inputs

At the moment, the minimap2 pipeline is recommended due to its ease of use. This one still works, it is just a fair bit slower and a lot harder to set up due to some of the inputs relying on the user to set them.

Specify running Nanopore dehosting with both the --nanopore and --nanostripper parameters as input.

Other parameters include:

Parameter	Description	Default	Optional
fast5_directory	Directory containing barcoded fast5 files (non-barcoded not yet supported)	None	No
human_ref	Fasta formatted human reference genome	None	Yes - Need to pass human_minimap2_index instead
human_minimap2_index	Path to human minimap2 index (*.mmi)	None	Yes - Need human_ref
cov2019_ref	Fasta formatted Sars-CoV-2 reference genome	data/nCoV-2019.reference.fasta	No
run_name	Specify output run name to separate results	None	No
min_length	Minimum fastq read length (based on scheme)	400	Yes
max_length	Maximum fastq read length (based on scheme)	2400	Yes
guppy_cpu	Path to Guppy CPU conda environment	None	Yes - Will only de-host fast5 files
guppy_gpu	Path to Guppy GPU environment (to speed up basecalling only)	None	Yes - Also need to give CPU env

Outputs

Found in ./results/<run_name>/run/ directory, the outputs for the Nanopore pipeline include:

dehosted_fast5 --> Folder containing the dehosted fast5 results of nanostripper for each barcode
fast5_pass --> Folder containing the finished dehosted demultiplexed fast5 files that can be used to run another analysis
fastq_pass --> Folder containing the finished dehosted demultiplexed fastq files that can be used for another analysis
sequencing_summary --> Simplified dehosted sequencing summary output only containing the read name and its specific fast5 file
removal_summary --> CSV file containing read removal metrics. Found in ./results/<run_name>/ instead

The output structure is setup as such so that the run_name organizes the sequencing data in the run folder and all analyses can be done in the run_name folder to separate out runs better.

Process

Nanostipper

Use nanostripper to remove human reads from the input fast5 files with one nanostripper instance running per barcode
- Inputs:
  - Sars-CoV-2 reference fasta genome
  - Human reference fasta genome or Human minimap2 indexed genome (speeds it up as only need to index once)
- Outputs:
  - Folder barcoded and dehosted fast5 files
- Tools:
  - nanostripper
  - minimap2 (inside nanostripper)
Guppy basecalling

Using guppy to re-basecall the remaining reads. This is done so that the most up to date version of Guppy is used and the best results are obtained as not all of the instraments are updated. A guppy environment must be provided as a parameter (at least CPU) to run this step and if none is given, the pipeline will not continue and you will only have the dehosted fast5 files.
- Inputs:
  - Dehosted barcoded fast5 files
- Outputs:
  - Dehosted only fastq files
- Tools:
  - guppy_basecaller CPU or GPU (faster)
Size selection

Filtering out dehosted fastq reads that do not fit the expected size of the primer scheme. All of the dehosted fastq reads are combined into one massive fastq file to run artic guppy_plex on and filter out reads of the wrong size
- Inputs:
  - All dehosted fastq files combined into one big fastq file
- Outputs:
  - Size selected dehosted fastq file (one big one still)
- Tools:
  - artic guppy_plex
demultiplex fastq files

The filtered and re-basecalled fastq files have to be re-demultiplexed incase they are a different barcode than originally thought or to get some of the unclassified reads to a barcode with the improvements from guppy_basecaller.
- Inputs:
  - Size selected dehosted fastq file
- Outputs:
  - fastq_pass - Barcoded folders containing their respective size selected dehosted fastq files
- Tools:
  - guppy_barcoder
Regenerate and demultiplex fast5 files and the simplified sequencing summary

Using the fully finished fastq files in the fastq_pass directory, re-demultiplex the fast5 files to match and generate a new simplified sequencing summary file for the run that just contains the read name and fast5 file its in. This is accomplished with fast5_subset.
- Inputs:
  - fastq_pass folder with barcoded size selected dehosted fastq files
  - Dehosted only fast5 folders from nanostripper
- Outputs:
  - fast5_pass - Barcoded dehosted fast5 files matching the fastq ones
  - sequencing_summary.txt
- Tools:
  - fast5_subset

Additional Info

-profile conda is required due to specifiying guppy environment(s) as a parameter(s)
- if not given, pipeline exits after nanostripper with dehosted fast5 files only
The --guppyCPU parameter is required for demultiplexing so at minimum that environment must be made
We re-basecall to get the benefits of the new guppy changes which improve accuracy in the output fastq files
The --guppyGPU parameter can be given as a way to speed up basecalling but it is not required and if not given, basecalling will be done with the specified --guppyCPU environment

Profiles

Profiles are a set of configuration attributes that can be activated (as many as you want as long as they are comma separated) when launching a pipeline. More information can be found here

If no profile is given, the Illumina and nanopore minimap2 pipelines will still function correctly and will run all processes locally so long as the system has all of the dependencies installed and meets the computational requirements. The nanopore nanostripper pipeline currently requires a guppy conda environment to re-basecall and demultiplex the fast5 files.

Profiles can be activated with -profile <profile_name>. Below are the profiles currently available with this pipeline

Conda

The conda profile can be activated by passing -profile conda.

This profile uses conda environments that are created on the fly or given by the cache parameter as --cache <path/to/conda/envs/> to control the dependencies and run the analysis

If the environments are having trouble being made, it is recommended to use mamba to install them

Custom

Custom profiles are specified in the main nextflow config file and allow for more specific allocation of resources along with other executors to be used (ex. slurm)

ex. NML config uses slurm with specific options

**Additional configs can be added fairly easily by request or on your own fork of the repo

Other Notes:

A fair amount of testing has gone into the pipeline but there may still be bugs or other issues. If you discover any please make an issue.

ErikaDva/ncov-dehoster

ncov-dehoster

About:

Index

Changelog Highlights

Release v0.2.0

Release v0.1.0

Dependencies and Data Required

Dependencies

Illumina Pipeline

Nanopore Minimap2 Fastq Pipeline

Nanopore Nanostripper Fast5 Pipeline

Data Required

Quick Start:

Illumina

Nanopore

Minimap2 Fastq Pipeline

Nanostripper Fast5 Pipeline (Developmental)

Pipeline Details:

Illumina Paired Fastq Dehosting Pipeline

Inputs

Running

Outputs

Process

Additional Info

Nanopore Fastq Minimap2 Dehosting and Regeneration Pipeline

Inputs

Running

Outputs

Process

Additional Info

Nanopore Fast5 Nanostripper Dehosting and Regeneration Pipeline (Developmental)

Inputs

Outputs

Process

Additional Info

Profiles

Conda

Custom

Other Notes: