Perl scripts to validate structural variation (SV) events from de-novo assembly.
-
is_{del,ins,dup,inv}.pl
Detect deletions, insertions, duplications, and inversions.
Input is tab-delimted BLAST-like alignment results. Standard output: 1 for valid/supported, 0 for invalid/no support. Prints additional information to standard error. Requires chromosome and coordinate of supposed breakpoint as arguments. -
is_tra.pl
Detects inter-chromosomal translocations.
Input is BEDPE split-alignments. Standard output: 1 for valid/supported, 0 for invalid/no support. Prints additional information to standard error. Requires chromosme and coordinate of supposed breakpoint as arguments. -
greedy.pl
Greedily finds a subset of non-overlapping hits in both the reference and the contig.
Input hits should be sorted by preferred significance. Requires the interval tree module.
In this example, I choose to favor hits with higher bit score:
$ cat blast.out | sort -k11,11 -g -r | greedy.pl > non_overlapping.out
Note: I will use some in-house scripts from misc and fixpairs.
Say you suspect there is a deletion in chromosome 1, coordinate 50000.
1) Extract reads +/- 1Kbp from the suspected breakpoint
$ get_reads.sh sorted.bam 'chr1:49000-51000'
2) De-novo assemble a contig from sampled reads
fermi-lite is an overlap-based assembler that I found perfect for this purpose.
$ run_fermi.sh r1.fq r2.fq
3) Re-align the contig
yaha is an aligner that is optimized for finding split-mappings.
It will not consider splits on two chromosomes, so use -FBS Y
for translocations and convert the alignment to BEDPE (bedtools). Also, -FBS Y
is necessary for finding duplications.
$ yaha -x ref.yhidx -q frm.fa -o8 yh
4) Infer validity of SV from alignment
$ cat yh | is_del.pl chr1 50000
0
We find that the deletion is not supported by de-novo assembly!
Say you suspect there is a duplication in chromosome 1 at coordinate 5000 and you want to validate it.
1) Extract reads +/- 1Kbp from the target coordinate
1-a) find IDs of the reads within this range:
$ samtools view alignnment_sorted.bam "chr1:4000-6000" | cut -f1 > ids.tmp
1-b) grep for reads with those IDs from original SAM file:
$ LC_ALL=C grep -w -F -f ids.tmp < alignment.sam > subset.sam
1-c) fix header information in the new SAM file:
$ echo -e "$(samtools view -H ../alignment.sam)\n$(cat subset.sam" > subset.sam
1-d) convert new SAM to reads. If paired-end reads:
$ samtools view -b subset.sam | samtools fastq -1 reads_1.fq reads_2.fq -
1-e) fix reads' headers. Here, the headers all start with "@SR":
$ perl -pi.bak -e 's/\@SR.+\S/$&\/1/' reads_1.fq
$ perl -pi.bak -e 's/\@SR.+\S/$&\/2/' reads_2.fq
2) De-novo assemble a contig from the extracted reads
$ velveth output 31 -shortPaired -separate -fastq reads_1.fq reads_2.fq
$ velvetg output -exp_cov auto -cov_cutoff auto
3) BLAST the contig onto a reference sequence
$ blastn -db reference.fasta -query contigs.fa -prec_identity 95 -outfmt 6 -out blast.out
4) Infer validity of duplication from alignment
$ cat blast.out | is_dup.pl
0
We find that the duplication is not supported by de-novo assembly!