Installing the analysis environment requires conda to be installed first.
To install the pipeline:
# Get pipeline
git clone https://github.com/rj-price/ont_assembly_starter.git
# Change to directory
cd ont_assembly_starter
# Create conda environment with all dependencies
conda env create -f environment.yml
# Activate environment
conda activate ont_assembly
This assembly pipeline processes Oxford Nanopore long-read data.
The main output from this pipeline is a long-read only assembly with quality metrics.
Environmental variable (change this to actual path):
scripts_dir=/dir/to/ont_assembly_starter/scripts
Used to trim sequencing adapters from ONT reads.
Set the reads and output directories (change these to the actual paths):
reads_dir=/dir/to/reads
out_dir=/dir/to/output
To run on a directory of reads:
sbatch "$scripts_dir"/01_porechop.sh \
"$reads_dir" \
"$out_dir" \
sample_name
Used to produce read QC metrics and plots.
Set the trimmed reads and output directories (change these to the actual paths):
reads_dir=/dir/to/porechop_out
out_dir=/dir/to/output
To run on a single trimmed reads file:
sbatch "$scripts_dir"/02_nanoplot.sh \
"$reads_dir"/sample_name.fastq.gz \
"$out_dir"
Used to remove reads under a specified read length and quality score.
Set the trimmed reads and output directories (change these to the actual paths):
reads_dir=/dir/to/porechop_out
out_dir=/dir/to/output
To run on a single reads file with a minimum read length of 1kb and a minimum quality score of Q12:
sbatch "$scripts_dir"/03_filtlong.sh \
"$reads_dir"/sample_name.fastq.gz \
1000 \
12 \
"$out_dir"
Used to correct and assemble ONT reads into contigs.
Set the filtered reads and output directories (change these to the actual paths):
reads_dir=/dir/to/filtlong_out
out_dir=/dir/to/output
Run assembly for 16Mb genome:
sbatch "$scripts_dir"/04_necat.sh \
"$reads_dir"/sample_name.fastq.gz \
16000000 \
"$out_dir"
Used to polish NECAT assemblies with long reads.
Set the filtered reads, NECAT assembly and output directories (change these to the actual paths):
reads_dir=/dir/to/filtlong_out
necat_dir=/dir/to/necat_out
out_dir=/dir/to/output
Run on NECAT assembly with one iteration:
sbatch "$scripts_dir"/05_racon.sh \
"$reads_dir"/sample_name.fastq.gz \
"$necat_dir"/sample_name_necat.fasta \
1 \
"$out_dir"
Used to polish Racon assemblies with long reads.
Set the filtered reads, Racon assembly and output directories (change these to the actual paths):
reads_dir=/dir/to/filtlong_out
racon_dir=/dir/to/racon_out
out_dir=/dir/to/output
Run on Racon assembly:
sbatch "$scripts_dir"/06_medaka.sh \
"$reads_dir"/sample_name.fastq.gz \
"$racon_dir"/sample_name_necat_racon.fasta \
"$out_dir"
Used as a quality control step to check number of conserved single copy orthologs present in assembly.
Set the assembly and output directories (change these to the actual paths):
medaka_dir=/dir/to/medaka_out
out_dir=/dir/to/output
Run on assembly file with the lineage fungi:
sbatch "$scripts_dir"/XX_busco.sh \
"$medaka_dir"/sample_name_necat_racon_medaka.fasta \
fungi \
"$out_dir"
Used as an assembly quality control step to check number of contigs, assembly size. contigs N50, etc.
Set the assembly and output directories (change these to the actual paths):
medaka_dir=/dir/to/medaka_out
out_dir=/dir/to/output
Run on assembly file:
sbatch "$scripts_dir"/XX_gfastats.sh \
"$medaka_dir"/sample_name_necat_racon_medaka.fasta \
"$out_dir"
Set the reads and output directories (change these to the actual paths):
reads_dir=/dir/to/reads
out_dir=/dir/to/output
Run full ONT assembly pipeline for 16Mb genome:
sbatch "$scripts_dir"/00_assembly.sh \
"$reads_dir" \
16000000 \
"$out_dir"