===========================================================================
_____ _ _ _ _
___ _ / ___ \ / _ \ | | | | | |
/ _ \ | | / / \ \ / / \ \ | | | | | |
/ / \ \ __| |__ | | \_| / / \ \ | | | | | |
|| |||__ __| | | / / _ \ \ | | | | | |
| \_/ / | | | | ___ / ___ \ | | | | | |
| __/ | |_ | | |__ | / / \ \ \ \ / / | | _
| | | | \ \ ___ / / / / \ \ \ \ ___ / / | | _____ | |
|_| |__/ \ _____ / /_/ \_\ \ _____ / | _________ |
===========================================================================
This pipeline is used for plastid (chloroplast) genome assembly based on long read data, including both Nanopore and PacBio. It can easily help assemble the complex plastomes with many long repeat regions which cannot be addressed by short read data only. Short reads assembly can sometimes generate many paths due to the long repeat regions (e.g. Juncus). This pipeline is very straitforward with two mendatory arguments (reference and raw long-read data). It usually takes about 10 minutes to assemble a plastome with 16Gb memory and less than 10Gbp sequence data. Our paper is in prep. [Zhou et al. (unpublished)]. We introduced this pipeline in BAGGs workshop at UNC-Chaple Hill.
- New options for ptGAUL: -o output directory; -g genome size; -c coverage.
- New argument for python script: -o output directory.
- Fixed the combine_gfa.py, which can be run automatically.
- First version.
Create a conda environment
conda create --name chloroplast python=3.7
source activate chloroplast
conda install -c bioconda ptgaul
ptGAUL.sh -h
Examples can be applied on Linux and Mac.
The basic arguments in ptGAUL.sh are 1) -r: a plastome from a closely related species (it should work for the references either from the same genus or the same family) and 2) -l: your long read data (any seuquence file in fasta, fastq, and fq.gz format).
If you run 1.0.4 version, the command in the ptGAUL_version directory. Otherwise, combine_gfa.py will not be able to run automatically.
ptGAUL.sh -r [PATH]/[reference_genome]/ -l [PATH]/[long_read_data]
ptGAUL.sh -r /path/Beta.fasta -l /path/SRR1980665.1 -t 8 -f 3000 -o ./ptgaul/
To check all parameters in ptGAUL using:
ptGAUL.sh -h
Usage: ptGAUL.sh -r (REFERENCE FILE) -l (LONG READ FILE)
[-t threads int] [-g genome size int]
[-c coverage int] [-f filter threshold int]
[-o output directory string]
this pipeline is used for plastome assembly using long read data.
optional arguments:
-h, --help <show this help message and exit>
-r, --reference <MANDATORY: reference contigs or scaffolds in fasta format>
-l, --longreads <MANDATORY: raw long reads in fasta/fastq/fq.gz format>
-t, --threads <number of threads, default:1>
-g, --genomesize <expected genome size of plastome (bp), default:160000>
-c, --coverage <a rough coverage of data used for plastome assembly, default:50>
-f, --filtered <the raw long reads will be filtered if the lengths are less than this number (bp); default: 3000>
-o, --outputdir <output directory of results, defult is current directory>
If the edge number does not equal 1 or 3 with abnormal plastid length, You should manually check the assembled data using BANDAGE. When you confirm the edges are three, you can manually run the python script again to get the assembly results including two paths.
combine_gfa.py -e ./PATH_OF_EDGES_FILE/edges.fa -d ./PATH_OF_SORTED_DEPTH_FILE/sorted_depth -o ./
This step will improve your assembly a little, but not too much. Using short reads is highly recommended (see as follows).
install racon using conda.
minimap2 -x ava-ont -t $n $asm $nanopore > ${racon_outdir}/map.paf
racon -t $n $nanopore_fq ${racon_outdir}/map.paf $asm > ${racon_outdir}/asm.racon.fasta
check if ropebwt2 is installed successfully by typing "ropebwt2 -h" in terminal.
check if msbwt is installed successfully by typing "msbwt -h" in terminal.
check if fmlrc is installed successfully by typing "fmlrc -h" in terminal.
Highly recommended steps: use fmlrc for polishing step. It outperforms other polishers.
files illumina_* are the fq.gz file of illumina reads. Change the output path directory "/PATH/msbwt".
gunzip -c $illumina_1r1 $illumina_1r2 $illumina_2r1 $illumina_2r2 | awk 'NR % 4 == 2' | sort | tr NT TN | ropebwt2 -LR | tr NT TN | msbwt convert /PATH/msbwt
Once you finished msbwt run. $N means thread number. $assembled_cp is assembled plastome from ptGAUL. Change the output path of "/PATH/fmlrc/corrected.fasta"
fmlrc -p $N /PATH/msbwt/comp_msbwt.npy $assembled_cp /PATH/fmlrc/corrected_cp.fasta
(in prep.) Zhou et al., Plastid Genome Assembly Using Long-read data (ptGAUL). DOI: 10.1101/2022.11.19.517194
If you are using fmlrc, please cite Wang, Jeremy R. and Holt, James and McMillan, Leonard and Jones, Corbin D. FMLRC: Hybrid long read error correction using an FM-index. BMC Bioinformatics, 2018. 19 (1) 50.