/digenomitas

Digenome software

Primary LanguageScalaBSD 3-Clause Clear LicenseBSD-3-Clause-Clear

Digenomitas

This repository contains software for analyzing digneome-sequencing data.

License

See the License file in this project.

Pre-requisites

  1. A Java Development Kit v8 or higher available from the Oracle Java site .
  2. A working installation of the Scala Built Tool aka sbt v1.1 or higher. Instructions for installing sbt can be found on the SBT website.
  3. A working internet connection which SBT will use to automatically download dependencies

Building

The software has been built and tested on Linux and Mac OS X. The software should build and run on any operating system that supports Java >= 8 and SBT, but other platforms have not been tested.

To compile and test the software run the following in the same directory as this README file:

sbt clean test

This process may take several minutes the first time it is run as sbt will download needed software libraries from the internet. Repeated executions should take 15-30s. A successful run of the tests will result in lines similar to the following being printed before sbt exits:

[info] All tests passed.
[info] Passed: Total 34, Failed 0, Errors 0, Passed 34
[success] Total time: 12 s, completed Oct 2, 2018 11:53:18 AM

The digenomitas software is compiled into a standalone JAR file for use. To build the JAR file run:

sbt assembly

As with the tests, the last line printed should contain [success], e.g.:

[success] Total time: 17 s, completed Oct 2, 2018 11:54:27 AM

This will produce a JAR file located at ./digenome/target/scala-2.12/digenome.jar.

Running

The primary tool is run by invoking the following command with appropriate options:

java -Xmx8g -jar digenome/target/scala-2.12/digenome.jar IdentifyCutSites

Input sequencing data

The tool expects as it's primary input a BAM file containing sorted, duplicate marked, aligned sequencing reads from a digenome experiment. In our experience the following set of command yields results that work well with the software (though any reasonable WGS alignment pipeline should be compatible):

# Align the data and convert the output to BAM
bwa mem -p -R '<read-group-string>' -t 32 hg38.fa r1.fastq.gz r2.fastq.gz \
  | sambamba view --sam-input -l 1 -t 3 -f bam -o aligned.bam /dev/stdin

# Sort the data into coordinate order
sambamba sort -m 4G --tmpdir . -l 5 -t 16 -o sorted.bam aligned.bam

# Duplicate mark the data
picard -Duse_async_io_write_samtools=true -Xmx8g MarkDuplicates \
  I=sorted.bam O=deduped.bam M=dupe_metrics.txt CREATE_INDEX=true

The tools used in the above commands can be installed easily using conda and the included conda-requirements.txt as follows:

conda create -n digneome -c bioconda --file conda-requirements.txt
conda activate digenome

IdentifyCutSites Options

Name Flag Type Description Required? Max Values Default Value(s)
input i BAM file Input BAM file of aligned, sorted and de-duped reads. Y Unlimited
output o Text file Output tab-delimited file of cut site predictions. Y 1
guide g String The sequence of the guide used in the experiment. N 1
pam-five-prime p String The PAM sequence associated with the guide, if found at the 5' end of the guide. N 1
pam-three-prime P String The PAM sequence associated with the guide, if found at the 3' end of the guide N 1
enzyme e String The name of the cutting enzyme used. Only used to label outputs. N 1
ref r FASTA file The FASTA file for the genome to which the BAM is aligned. The reference must have been indexed, e.g. with samtools faidx. Y 1
intervals l IntervalList File An optional file of regions to restrict the analysis to, in Picard IntervalList format. N 1
min-map-q m Integer The minimum mapping quality for a read to be used in the analysis. N 1 30
overhang Integer The expected overhang at the cut site (varies by enzyme). Positive for 5' overhang, negative for 3'. N 1 0
max-offset x Integer The maximum distance a read-start can be from the expected cut site and still be counted. N 1 2
min-depth N Integer Minimum sequencing depth at/near a cut site for the cut site to be emitted. N 1 10
max-depth X Integer Maximum sequencing depth at/near a cut site for the cut site to be emitted. N 1 300
max-low-mapq-fraction Q Decimal The maximum fraction of reads supporting a cut site with mapq < min-map-q. N 1 0.3
min-forward-reads F Integer Minimum number of forward strand reads supporting a cut site. N 1 4
min-reverse-reads R Integer Minimum number of reverse strand reads supporting a cut site. N 1 4
min-supporting-reads M Integer Minimum total number of reads supporting a cut site . . 1
min-supporting-fraction S Decimal Minimum fraction of reads supporting a cut site [cut reads / (cut reads + spanning reads)].). N 1 0.2

Example Invocation

An example invocation of the toolkit might look like:

java -Xmx8g -jar digenome/target/scala-2.12/digenome.jar IdentifyCutSites \
  --input=s1.aligned.sorted.deduped.bam \
  --output=s1.cut_sites.txt \
  --guide=ACGTTACGAAACGCCACGGGA \
  --pam-three-prime=NNGRRN \
  --enzyme=SauCas9 \
  --ref=/refs/hg38/hg38.fa  

Output Format

The output file generated is a tab-delimited text file with the following columns:

Column Name Description Example Value(s)
digenomitas_version The version of the software that produced the results. 20181002-bd34ba2-dirty
sample The name of the smaple used int he experiment (taken from the BAM file header). EMX1-example-sample
guide The guide sequenced used in the experiment. GTTCTGTCCTCAGTAAAAGGTA
enzyme The enzyme used in the experiment. SauCas9
expected_overhang The expected overhang of the enzyme (as passed to --overhang when running analysis. 0
window_size The window-size option given to the tool when running the analysis. 5
mapq_cutoff The mapq cutoff given to the tool when running the analysis. 30
chrom The chromosome on which the putative cut site resides. chr12
pos The most likely position of the cut site. 88102160
strand The strand of the cut site. F
low_mapq_fraction The fraction of reads supporting the cut site that had low mapping quality. 0
forward_starts The number of F strand reads supporting the cut site. 50
reverse_starts The number of R strand reads supporting the cut site. 53
read_depth The total number of reads at the cut site (including spanning reads). 104
read_fraction_cut The fraction of reads at the cut site that support the cut site (i.e. do not span it). 0.990385
read_score A score representing confidence in the cut site. Scales with read_depth and read_fraction_cut. 132.129971
template_depth The number of templates (aka inserts) at the cut site (including spanning templates). 105
template_fraction_cut The fraction of templates that appear to be cut at the cut site. 0.980952
template_score A template-based score representing confidence in the cut site. Scales with template_depth and template_fraction_cut. 145.65576
median_overhang The median overhang of reads at the cut site. 0
overhang_distribution The distribution of read overhangs at the cut site. By default contains five semi-colon separated values that represent the count of reas with overhang = -2, -1, 0, 1, 2. 0;2;48;0;0
neighborhood_ratio A measure of whether read-starts are suppressed in the neighborhood surrounding the cut site. When there are no read starts in the area around a cut site this will be ~0, when there are the expected number of read starts for the observed coverage and random read start positions, the value will be ~1. 0 or close to 0 is expected for a true cut site. 0
aln_start The 1-based start position of the best alignment of the guide+pam at or near the cut site. 88102142
aln_end The 1-based end position of the best alignment of the guide+pam at or near the cut site. 88102170
aln_strand The strand of the genome on which the best alignment of the guide+pam resides. F
aln_padded_guide The padded guide sequence at the alignment position. GTTCTGTCCTCAGTAAAAGGTAnngrrn
aln_alignment_string A string representing the pairwise alignment of the guide+pam to the reference. `
aln_padded_target The padded sequence of the reference at the alignment position. GTTCTGTCCTCAGTAAAAGGTATAGAGT
aln_mismatches The number of mismatches in the alignment. 0
aln_gap_bases The number of gapped bases in the alignment (i.e. a gap of length 2 is counted as 2). 0
aln_mm_and_gaps The sum of aln_mismatches and aln_gap_bases. 0

The padded alignment strings are intended to be concatenated with line breaks to help visualize the alignment. The following is an example of a more complicated alignment represented in the same format, with three mismatches and one bulge/indel.

GTTCTGTCCTCAGTAAAAGGTAnngrrn
|||||..||||||||||| |.|||||||
GTTCTACCCTCAGTAAAA-GAACTGGGT

Example Dataset

Files are located at:

  1. digeome-emx1-example-data.bam
  2. digeome-emx1-example-data.bai

An example dataset is included in this repository. The dataset is contains sequencing reads over nine 10kb wide regions from a sample treated with EMX1 guide GTTCTGTCCTCAGTAAAAGGTA and Staph aureus CAS9. The regions include the expected on-target cut site and a number of putative off-target cut sites (including some likely false positives).

To analyze the data try running:

java -Xmx8g -jar digenome/target/scala-2.12/digenome.jar IdentifyCutSites \
  --input=digeome-emx1-example-data.bam \
  --output=cut_sites.txt \
  --guide=GTTCTGTCCTCAGTAAAAGGTA \
  --pam-three-prime=NNGRRN \
  --enzyme=SauCas9 \
  --ref=/refs/hg38/hg38.fa