/ref-guided-assembly-pipeline

Adapting the pipeline described in Lischer & Schmizu (2017, doi:10.1186/s12859-017-1911-6) for my own use.

Primary LanguageShell

Repo forked from https://bitbucket.org/HeidiLischer/refguideddenovoassembly_pipelines/

To do list:
1. Test run code
2. Assess reference guided assembly quality
3. Adapt code for SPAdes assember
4. Compare assembly quality
5. Optimise assembly parameters

#########################################################
#  README -
#  Reference-guided de novo assembly
# ====================================================
# by Heidi Lischer (heidi.lischer@ieu.uzh.ch), 2015/15
#########################################################

We adapted and extended the reference-guided assembly approach from Schneeberger et al. [1]. 
The main idea of this approach is to first map reads against a reference genome of a 
related species to reduce the complexity of de novo assembly within continuous covered 
regions. In a further step, reads with no similarity to the related genome are integrated. 

Reference-guided de novo assembly pipeline:
1. Step: quality/adapter trimming and quality check
2. Step: map reads against reference and define blocks and superblocks
3. Step: do deNovo assembly within superblocks
4. Step: get non-redundant supercontigs
5. Step: map reads on supercontigs and de novo assemble unmapped reads
6. Step: map reads to all supercontics and correct them 
7. Step: scaffolding and gap closing


[1] http://www.pnas.org/content/early/2011/06/01/1107739108.full.pdf+html?with-ds=yes


 Dependencies
--------------
(in brackets: used version of softwares)

Third party programs:
- fastqc (v 0.10.1): Fastq quality checking
- trimmomatic-0.32 (v 0.32): quality and adapter trimming
- samtools (v 1.3): Tools for alignments in the SAM format
- bcftools (v 1.3): Tools for variant calling and manipulating VCFs and BCFs
- bamtools (v 2.3.0): Tools for alignments in the BAM format
- bedtools (v 2.19.0): Tool for genomic arithmetics
- picardtools (v 1.109): Processing alignment files
- bowtie2 (v 2.2.1): NGS alignment tool
- seqtk (v 1.0-r45): Fast Fasta/Fastq manipulation tool
- AMOScmp (v 3.1.0): Comprarative genome assembly
- MUMmer (v 3.23): Comprarative genome assembly
- GenomeAnalysisTK (v 3.1-1): Genome analysis toolkit
- SOAPdenovo2 (v r240): De novo genome assembler and scaffolder

and one of these de novo assemblers:
- AllPaths-LG (v 51279)
- idba (v 1.1.1)
- abyss-pe (v 1.5.2)
- SOAPdenovo2 (v r240)


 Additional in-house scripts
-----------------------------
- RemoveShortSeq.jar: Remove short sequences and make unique identifiers from FASTA/FASTQ 
- GetBlocks_new.jar: Create blocks and superblocks
- FastaToAmos.jar: Transform FASTA to Amos format
- WriteSoapConfig.jar: writes a SOAP config file
- FastaStats.jar: outputs FASTA statistics
- SplitSeqLowCov.jar: splits sequences at low coverage


 Running
---------
In first lines of the main scripts you have to adapt the paths to your system.
(Everything between: 
 # set variables #########################################
 ...
 #########################################################)


If the parameters are set you can run the pipeline as follows:
bash refGuidedDeNovoAssembly_ALLPATHS.sh
or
bash refGuidedDeNovoAssembly_IDBA.sh
or
bash refGuidedDeNovoAssembly_ABYSS.sh
or
bash refGuidedDeNovoAssembly_SOAP.sh