OrganelleRef_PBA is a script to perform a de-novo PacBio assemblies of any organelle (chloroplast or mitochondrial genomes) using several programs.
The different steps are:
-
Search of the PacBio organelle reads by sequence homology search using BlasR with a related organelle genome. It is preferred to use an organelle sequence of the same genus or family, but if the organelle sequence coverage is high (>100X) it is possible to use organelle sequence references from the same taxonomic order.
-
Assemble of the PacBio reads using Sprai. Sprai is a pipeline that uses WGS-Assembler, but that compare the reads between them before perform the assembly to take the best 20X.
-
If the fraction of the Sprai assembly is below some ratio, the script will perform a rescaffolding using the whole PacBio set. Otherwise, it will skip this step.
-
Taking the longest sequence, it will check for the origin of the organelle comparing the sequence with the reference. Additionally it will check if there are an overlapping region produced by the circularity of the organelle.
-
It will check if the repeats have been missassembled, it will break the assembly in LCS, SSC and IR and it will try to put together, removing possible missassembled fragments.
- BioPerl -- (used to process sequences)
- Seqtk -- (used to change formats fastq/fasta)
- BlastN -- (used for the assembly, find origin and check circularity)
- BlasR -- (used to get the organelle related reads)
- Samtools -- (used to process BlasR output for coverage)
- Bedtools -- (used to calculate coverage for the repeat analysis)
- Sprai -- (used for de-novo assembly)
- WGS-Assembler -- (used for de-novo asembly by Sprai)
- SSPACE-Long -- (used for the rescaffolding)
Note: SSPACE-Long uses getopt that it is not present in the Perl5 corelib. To fix this problem you can install it with cpan Perl4::CoreLibs
.
Most of these programs can be installed from repositories (e.g. Blast).
To install the program
git clone https://github.com/aubombarely/Organelle_PBA.git
Once the directory is copied, you'll need to set up the environmental variables if the binaries of these programs are not in the PATH.
export BLASR_PATH=<path_to_BlasR_binary>;
export SAMTOOLS_PATH=<path_to_samtools_binaries>;
export SPRAI_PATH=<path_to_Sprai_scripts>;
export BLAST_PATH=<path_to_blast_binaries>;
export CA_PATH=<path_to_WGS-assembler_binaries>;
export SSPACELONG_PATH=<path_to_SSPACE-Long.pl_script>;
export BEDTOOLS_PATH=<path_to_bedtools_binaries>;
To install using docker
git clone https://github.com/cgjosephlee/Organelle_PBA.git # currently my forked repo
cd Organelle_PBA
# SSPACE is not available in public domain, so
# acquire SSPACE from https://www.baseclear.com/services/bioinformatics/basetools/sspace-longread/
cp /path/to/SSPACE-LongRead.pl vendor/
chmod +x vendor/SSPACE-LongRead.pl
docker build -t aubombarely/organelle_pba .
docker run --rm aubombarely/organelle_pba
mkdir chloro_out
OrganelleRef_PBA -i MySpeciesPacBio.fastq -r MyReferenceCHL.fasta -o chloro_out
# or use docker
docker run --rm -v $PWD:/data --user $UID:$GID aubombarely/organelle_pba \
OrganelleRef_PBA -i MySpeciesPacBio.fastq -r MyReferenceCHL.fasta -o chloro_out
You can test the script with the test data. This data is a subset of the Arabidopsis thaliana PacBio data publicly available at SRA with the accession SRR1284093.
cd testdata
gzip -d artha_pacbioSRR1284093_c025k.fastq.gz artha_refchl01_artha.fa.gz
mkdir -p artha_chl
OrganelleRef_PBA -i artha_pacbioSRR1284093_c025k.fastq -r artha_refchl01_artha.fa -o artha_chl
# runs for ~30 min using 40 cores
Note: To speed up the process you can use multiple threads through different variables such as -b '--nproc=40' -s 'num_threads=40'