For an explanation of the general purpose of HybPiper, and the approach it takes to generate target locus sequences from sequence capture data, please see the documentation, wiki and tutorial at the mossmatters
repo:
- https://github.com/mossmatters/HybPiper/
- https://github.com/mossmatters/HybPiper/wiki
- https://github.com/mossmatters/HybPiper/wiki/Tutorial
To simplify running HybPiper, I’ve provided a Singularity container based on the Linux distribution Ubuntu 22.04, with all the software required to run the HybPiper pipeline (including some additional functionality, see below for details), as well as all the dependencies (BioPython, BLAST, BWA, BBmap, Exonerate, SPAdes, Samtools). The container is called hybpiper-paragone.sif
.
To run HybPiper using this container, I’ve provided a Nextflow pipeline that uses the software in the Singularity container. This pipeline runs all HybPiper steps with a single command. The pipeline script is called hybpiper.nf
. It comes with an associated config file called hybpiper.config
. The only input required is a folder of sequencing reads for your samples, and a target file in .fasta
format. The Nextflow pipeline will automatically generate the namelist.txt
file required by some of the HybPiper scripts, and will run all HybPiper scripts on each sample in parallel. It also includes an optional read-trimmming step to QC your reads prior to running HybPiper, using the software Trimmomatic. The number of parallel processes running at any time, as well as computing resources given to each process (e.g. number of CPUs, amount of RAM etc) can be configured by the user by modifying the provided config file. The pipeline can be run directly on your local computer, and on an HPC system submitting jobs via a scheduler (e.g. SLURM, PBS, etc).
Your target file
, which can contain either nucleotide OR protein sequences (see below for corresponding pipeline options), should follow the formatting described for HybPiper here. Briefly, your target genes should be grouped and differentiated by a suffix in the fasta header, consisting of a dash followed by an ID unique to each gene, e.g.:
>AJFN-4471
AATGTTATACAGGATGAAGAGAAACTGAATACTGCAAACTCCGATTGGATGCGGAAATACAAAGGCT...
>Ambtr-4471
AGTGTTATTCAAGATGAAGATGTATTGTCGACAGCCAATGTGGATTGGATGCGGAAATATAAGGGCA...
>Ambtr-4527
GAGGAGCGGGTGATTGCCTTGGTCGTTGGTGGTGGGGGTAGAGAACATGCTCTATGCTATGCTTTGC...
>Arath-4691
GAGCTTGGATCTATCGCTTGCGCAGCTCTCTGTGCTTGCACTCTTACAATAGCTTCTCCTGTTATTG...
>BHYC-4691
GAAGTGAACTGTGTTGCTTGTGGGTTTCTTGCTGCTCTTGCTGTCACTGCTTCTCCCGTAATCGCTG...
etc...
You will need to provide the hybpiper.nf
pipeline with either:
a) A directory of paired-end forwards and reverse reads (and, optionally, a file of single-end reads) for each sample; or
b) A directory of single-end reads.
For the read files to be recognised by the pipeline, they should be named according to the default convention:
*_R1.fastq
*_R2.fastq
*_single.fastq (optional; will be used if running with the flag `--paired_and_single`)
OR
*_R1.fq
*_R2.fq
*_single.fq (optional; will be used if running with the flag `--paired_and_single`)
NOTE:
- It’s fine if there’s text after the
R1
/R1
and before the.fastq
/.fq
, as long as it's the same for both read files. - It’s fine if the input files are gzipped (i.e. suffix
.gz
). - You can provide a folder of single-end reads only (use the flag
--single_only
if you do). Your read files should be named*_single.fastq
(or.fastq.gz
,.fq
,.fq.gz
). - You can specify a custom pattern used for read file matching via the parameters
--read_pairs_pattern <pattern>
or--single_pattern <pattern>
. - If your samples have been run across multiple lanes, you'll likely want to combine read files for each sample before processing. The pipeline can do this for you - see here for details.
Please see the Wiki entry Running on Linux.
Please see the Wiki entry Running on a Mac.
NOTE: Macs using the new Apple M1 chip are not yet supported.
Please see the Wiki entry Running on a PC.
Example run command:
nextflow run hybpiper.nf -c hybpiper.config -entry assemble -profile standard_singularity --illumina_reads_directory reads_for_hybpiper --targetfile_dna Angiosperms353_targetSequences.fasta
Mandatory arguments:
############################################################################
--illumina_reads_directory <directory>
Path to folder containing illumina read file(s)
AND
--targetfile_dna <file> File containing fasta sequences of target genes
(nucleotides)
OR
--targetfile_aa <file> File containing fasta sequences of target genes
(amino-acids)
#############################################################################
Optional arguments:
-profile <profile> Configuration profile to use. Can use multiple
(comma separated). Available: standard (default),
standard_singularity, slurm_singularity, conda,
conda_slurm
--namelist A text file containing sample names. Only these
samples will be processed, By default, all samples
in the provided <Illumina_reads_directory>
directory are processed
--combine_read_files Group and concatenate read-files via a common prefix.
Useful if samples have been run across multiple lanes.
Default prefix is all text preceding the first
underscore (_) in read filenames
--combine_read_files_num_fields <int>
Number of fields (delimited by an underscore) to use
for combining read files when using the
`--combine_read_files` flag. Default is 1
--num_forks <int> Specify the number of parallel processes (e.g.
concurrent runs of 'hybpiper assemble') to run at any
one time. Can be used to prevent Nextflow from using
all the threads/cpus on your machine. Default is
to use the maximum number possible
--outdir <directory_name>
Specify the name of the pipeline results directory.
Default is 'results'
--paired_and_single Use when providing both paired-end R1 and R2 read
files as well as a file of single-end reads for each
sample
--single_end Use when providing providing only a folder of
single-end reads
--read_pairs_pattern <pattern>
Provide a comma-separated read-pair pattern for
matching fowards and reverse paired-end readfiles,
e.g. '1P,2P'. Default is 'R1,R2'
--single_pattern <pattern>
Provide a pattern for matching single-end read
files. Default is 'single'
#######################################################################################
################################ Trimmomatic options: #################################
#######################################################################################
--use_trimmomatic Trim forwards and reverse reads using Trimmomatic.
Default is off
--trimmomatic_leading_quality <int>
Cut bases off the start of a read, if below this
threshold quality.Default is 3
--trimmomatic_trailing_quality <int>
Cut bases off the end of a read, if below this
threshold quality. Default is 3
--trimmomatic_min_length <int>
Drop a read if it is below this specified length.
Default is 36
--trimmomatic_sliding_window_size <int>
Size of the sliding window used by Trimmomatic;
specifies the number of bases to average across.
Default is 4
--trimmomatic_sliding_window_quality <int>
Specifies the average quality required within the
sliding window. Default is 20
#######################################################################################
############################# hybpiper assemble options: ##############################
#######################################################################################
--bwa Use BWA to search reads for hits to target. Requires
BWA and a target file that is nucleotides!
--diamond Use DIAMOND instead of BLASTx
--diamond_sensitivity Use the provided sensitivity for DIAMOND searches.
Option are: 'mid-sensitive', 'sensitive',
'more-sensitive', 'very-sensitive', 'ultra-sensitive'
--distribute_low_mem Distributing and writing reads to individual gene
directories will be 40-50 percent slower, but can use
less memory/RAM with large input files
--evalue e-value threshold for blastx/DIAMOND hits, default: 0.0001
--max_target_seqs Max target seqs to save in BLASTx search, default is 10
--cov_cutoff <int> Coverage cutoff to pass to the SPAdes assembler.
Default is 8
--single_cell_assembly Run SPAdes assemblies in MDA (single-cell) mode.
Default is False
--kvals Values of k for SPAdes assemblies. SPAdes needs to be
compiled to handle larger k-values! Default is
auto-detection by SPAdes
--thresh Percent identity threshold for retaining Exonerate
hits. Default is 55, but increase this if you are
worried about contaminant sequences
--paralog_min_length_percentage <decimal>
Minimum length percentage of a SPAdes contig vs
reference protein query for a paralog warning to be
generated and a putative paralog contig to be
recovered. Default is 0.75
--depth_multiplier Assign a long paralog as the "main" sequence if it
has a coverage depth <depth_multiplier> times all
other long paralogs. Set to zero to not use depth.
Default is 10
--timeout_assemble Kill long-running gene assemblies if they take longer
than X percent of average
--timeout_exonerate_contigs Kill long-running processes if they take longer than
X seconds. Default is 120
--target Use the target file sequence with this taxon name in
Exonerate searches for each gene. Other targets for
that gene will be used only for read sorting. Can be a
tab-delimited file (one <gene>\t<taxon_name> per line)
or a single taxon name
--exclude Do not use any sequence with the specified taxon name
string in Exonerate searches. Sequenced from this
taxon will still be used for read sorting
--no_stitched_contig Do not create stitched contigs; use longest Exonerate
hit only. Default is off
--chimera_test_memory <int> Memory (RAM) amount in MB to use for bbmap.sh when
peforming stitched-contig chimera tests. Default is
1000 MB
--bbmap_subfilter <int> Ban alignments with more than this many
substitutions when performing read-pair mapping to
supercontig reference (bbmap.sh). Default is 7
--chimeric_stitched_contig_edit_distance <int>
Minimum number of base differences between one read
of a read pair vs the stitched-contig reference for a
read pair to be flagged as discordant. Default is 5
--chimeric_stitched_contig_discordant_reads_cutoff <int>
Minimum number of discordant reads pairs required
to flag a stitched-contig as a potential chimera of
contigs from multiple paralogs. Default is 5
--exonerate_hit_sliding_window_size <int>
Size of the sliding window (in amino-acids) when
trimming termini of Exonerate hits. Default is 3.
--exonerate_hit_sliding_window_thresh <int>
Percentage similarity threshold for the sliding window
(in amino-acids) when trimming termini of Exonerate hits.
Default is 55.
--merged Merge forward and reverse reads, and run SPAdes
assembly with merged and unmerged (the latter
in interleaved format) data. Default is off
--run_intronerate Run the intronerate() function to recover intron
and supercontig sequences. Default is off, and so
fasta files in `subfolders 09_sequences_intron` and
`10_sequences_supercontig` will be empty
--keep_intermediate_files Keep all intermediate files and logs, which can be
useful for debugging. Default action is to delete
them, which greatly reduces the total file number
--no_padding_supercontigs If Intronerate is run, and a supercontig is created
by concatenating multiple SPAdes contigs, do not add
10 "N" characters between contig joins. By default,
Ns will be added
--verbose_logging If supplied, enable verbose login. NOTE: this can
increase the size of the log files by an order of
magnitude
#######################################################################################
####################### hybpiper paralog_retriever options: ###########################
#######################################################################################
--paralogs_list_threshold_percentage PARALOGS_LIST_THRESHOLD_PERCENTAGE
Percent of total number of samples and genes that must
have paralog warnings to be reported in the
<genes_with_paralogs.txt> report file. The default is
0.0, meaning that all genes and samples with at least
one paralog warning will be reported
--figure_length FIGURE_LENGTH
Length dimension (in inches) for the output heatmap
file. Default is automatically calculated based on the
number of genes
--figure_height FIGURE_HEIGHT
Height dimension (in inches) for the output heatmap
file. Default is automatically calculated based on the
number of samples
--sample_text_size SAMPLE_TEXT_SIZE
Size (in points) for the sample text labels in the
output heatmap file. Default is automatically
calculated based on the number of samples
--gene_text_size GENE_TEXT_SIZE
Size (in points) for the gene text labels in the
output heatmap file. Default is automatically
calculated based on the number of genes
--heatmap_filetype {png,pdf,eps,tiff,svg}
File type to save the output heatmap image as. Default
is png
--heatmap_dpi HEATMAP_DPI
Dots per inch (DPI) for the output heatmap image.
Default is 300
#######################################################################################
######################## hybpiper recovery_heatmap options: ###########################
#######################################################################################
--figure_length FIGURE_LENGTH
Length dimension (in inches) for the output heatmap
file. Default is automatically calculated based on the
number of genes
--figure_height FIGURE_HEIGHT
Height dimension (in inches) for the output heatmap
file. Default is automatically calculated based on the
number of samples
--sample_text_size SAMPLE_TEXT_SIZE
Size (in points) for the sample text labels in the
output heatmap file. Default is automatically
calculated based on the number of samples
--gene_text_size GENE_TEXT_SIZE
Size (in points) for the gene text labels in the
output heatmap file. Default is automatically
calculated based on the number of genes
--heatmap_filetype {png,pdf,eps,tiff,svg}
File type to save the output heatmap image as. Default
is *.png
--heatmap_dpi HEATMAP_DPI
Dot per inch (DPI) for the output heatmap image.
Default is 150
Please see the Wiki entry Additional pipeline features and details for further explanation of the parameters above, and general pipeline functionality.
If you're just after the unaligned .fasta
files for each of your target genes (not including putative paralogs), the two main output folders of interest are probably:
07_sequences_dna
08_sequences_aa
If you need the per-sample output folders produced by a standard HybPiper run, these can be found in folder:
04_processed_gene_directories
For a full explanation of output folders and files, please see the Wiki entry Output folders and files.
For details on adapting the pipeline to run on local and HPC computing resources, see here.
Please see the Wiki entry Issues.
31 May 2024
- Updated the HybPiper version in the Singularity container
hybpiper-paragone
to version 2.1.7, and updated thehybpiper-nf
script to version 1.0.4 - Fixed an issue when calling
hybpiper retreive_sequences
with an amino-acid target file. - Updated the flag
--use_diamond
to--diamond
to match standalong HybPiper. - Updated the flag
--run_intronerate
to--no_intronerate
to reflect changes made in HybPiper >= 2.1.6
03 July 2023
- Updated the HybPiper version in the Singularity container
hybpiper-paragone
to version 2.1.5, and updated thehybpiper-nf
script to version 1.0.3 - Added the new options
exonerate_hit_sliding_window_size
andexonerate_hit_sliding_window_thresh
to thehybpiper assemble
options.
15 May 2023
- Bugfix: BBmap.sh memory for Hybpiper chimera test was set to a default of 1 Mb. Changed to 1000 Mb.
- Update version in hybpiper-nf script to version 1.0.2
14 April 2023
- Updated the HybPiper version in the Singularity container
hybpiper-paragone
to version 2.1.3, and updated thehybpiper-nf
script to version 1.0.1 - Bugfix: get target file basename for HybPiper commands (fixes error when using a target file that isn't in the current working directory)
01 February 2023
- Added a
conda
andconda_slurm
profile. This allows the pipeline to be run using conda packages rather than the Singularity container. The corresponding conda environment is created in the nextflowwork
directory. - Updated the
hybpiper.nf
script to support all native HybPiper 2 parameters. - Added a script version number; view by using the flag
--version
. - Split the main HybPiper
assemble
pipeline, and the HybPiper commandscheck_targetfile
and thefix_targetfile
; these are now separate entry points to thehybpiper.nf
script, accessible by using the parameter-entry
, e.g.-entry assemble
. - Added separate help docs for each entry point, e.g.
-entry assemble --help
.
28 November 2022
- Change repository name from
HybPiper-RBGV
tohybpiper-nf
. - Update the required container name from
hybpiper-yang-and-smith-rbgv.sif
tohybpiper-paragone.sif
. - Update containerised HybPiper version and corresponding Nextflow scripts to HybPiper version 2
09 November 2021
- Nextflow script updated to DSL2. NOTE: Nextflow version >= 21.04.1 is now required!
- Singularity *.def file updated to use Ubuntu 20.04, and to install Muscle and FastTreeMP.