/targeted-phasing-consensus

A workflow for phasing genes using long fragment target capture data

Primary LanguagePython

Visualizing phasing in IGV

targeted-phasing-consensus.sh

The targeted-phasing-consensus.sh script runs phasing on a subset of aligned PacBio Reads of Insert (CCS reads) corresponding to a gene or region of interest, and generates a Quiver-based consensus sequence for each phase.

This workflow was designed for experiments utilizing long fragment target capture, such as the upcoming 6 kb capture protocol using Roche NimbleGen’s® SeqCap EZ Enrichment®.

Installing

Download these two files into the same directory:

% wget https://github.com/lhon/targeted-phasing-consensus/raw/master/targeted-phasing-consensus.sh
% wget https://github.com/lhon/targeted-phasing-consensus/raw/master/faidx.zip

Running targeted-phasing-consensus.sh requires having SMRT Analysis 2.3.0 installed.

Running

  1. In SMRT Portal, use the RS_ReadsOfInsert_Mapping protocol with the following parameters to generate and align single best estimate Reads of Insert per molecule:

    • Minimum Full Passes: 0
    • Minimum Predicted Accuracy: 75
  2. The name of the reference, aligned_reads.bam, and input.fofn from Step 1 can then be used for the phasing step. source the SMRT Analysis environment, and then run targeted-phasing-consensus.sh on a targeted region:

% source /opt/smrtanalysis/etc/setup.sh
% ./targeted-phasing-consensus.sh chr17 41243000 41244200 BRCA1 hg19_M_sorted /opt/smrtanalysis/common/jobs/087/087197/data/aligned_reads.bam /opt/smrtanalysis/common/jobs/087/087197/input.fofn

In the example above, we are looking at a subset of BRCA1 on chr17 between positions 41243000 and 41244200. hg19_M_sorted was the name of the reference in SMRT Portal, and the paths to aligned_reads.bam and input.fofn from job 87197 were specified. The results and intermediate output will be placed in the BRCA1 subdirectory.

Outputs

Some relevant files in the output directory:

  • subset.bam is the subset of the full BAM file containing reads mapping to the region of interest.
  • phase.out is the output from the samtools phase command. Successfully phased regions are delineated by the PS (phase set) marker. M1 describes detected heterogyzote SNPs that passed the filter.
  • consensus0.fasta and consensus1.fasta are the consensus sequences for each phase, generated by Quiver.
  • nucmer.snps shows the SNP differences between the phases.