#Overview
SCUBAT (Scaffolding Contigs Using BLAT And Transcripts) uses any set of transcripts to identify cases where a transcript is split over multiple genome fragments and attempts to use this information to scaffold the genome.
#Software Requirements
CAP3 - to assemble scaffolds
GNU parallel - to run the CAP3 assemblies in parallel
#Data requirments
- A genome in FASTA format
- A set of transcripts in FASTA format
- A BLAT psl file aligning the transcripts to the genome
#Quick guide
scubat.pl -t genome.fa -q transcripts.fa -p blat_output.psl
To see all options just run scubat.pl
#Details
SCUBAT procedes via 6 steps:
###1) Align the transcripts to the genome (not part of the script):
- Externally - Use BLAT to align the transcripts to the genome with psl output, e.g.
blat -t=dna genome.fa -q=dna transcripts.fa -noHead out.psl
###2) Identify informative split transcripts:
- All transcripts that mapped to more than one contig are flagged. Each mapping is then analysed to identify those that map consecutive non-overlapping sections of the transcript on separate contigs allowing for an overlap buffer zone of up to 20bp.
###3) Create scaffolds:
- Each transcript-contig complex is assembled by orientating the contigs based on the BLAT information and adding 10 N’s in-between the contigs.
###4) Cluster scaffolds into groups and assemble:
- The contigs used in each transcript-contig complex are then cross-referenced and any complexes sharing a contig are grouped. Groups are then assembled using CAP3 with default parameters.
###5) Filter the assemblies:
- The assemblies are parsed to identify successful assembly events. Failed assemblies are identified and the largest complex kept, the remainder removed.
###6) Create new contig set:
- A new contig set is generated using the successful CAP3 assemblies, the remaining complexes and any contig not involved.