SMURF-seq is a protocol for sequencing short reads on a long-read sequencer by randomly concatenating short fragments. This repo contains the scripts we used to conduct initial analysis of SMURF-seq data in the context of copy-number profiling (with sequencing done on the Oxford MinION instrument), and to benchmark mapping methods for performance in mapping SMURF-seq reads.
The copy-number analysis we performed using SMURF-seq reads closely follows procedures already used in other publications. We first map the SMURF-seq reads using BWA:
bwa mem -x ont2d -k 12 -W 12 \
-A 4 -B 10 -O 6 -E 3 -T 120 bwa-mem/index/hg19.fa \
smurf_reads.fa > mapped_smurf_reads.sam
The parameters for the Smith-Waterman scoring ('A', 'B', 'O' and 'E') were determined using the simulation approach outined below (see also manuscript and supp info). The 'T' flag gives the minimum alignment score to output. The 'k' gives the size of k-mers to use for seeds. The 'W' indicates to discard a chain if seed bases are shorter than this value. The 'k' and 'W' are set to be liberal to catch and evaluate as many candidate mappings as possible.
The mapped fragments are given to a script that filters ambiguously mapped fragments:
./filterAlnScoreAndQual.py -i mapped_smurf_reads.sam \
-o unambig_smurf_frags.sam -s 120 -q 1
The input file mapped_smurf_reads.sam
is just the mapped reads
(e.g. with BWA). The output file unambig_smurf_frags.sam
contains mapped fragments with mapping quality greater than or equal
to 1.
Then the remaining fragments are given to a script that obtains the counts of reads in bins:
./getBinCounts.py -i unambig_smurf_frags.sam -c hg19.chrom.sizes \
-b bins_5k_hg19.bed -o bin_counts.bed -s bin_stats.txt
The input file unambig_smurf_frags.sam
is the same as described above.
The file hg19.chrom.size
is the size of all chroms
in the reference genome. This file for the hg19 reference is supplied
in the data
directory in this repo, and was obtained from the UCSC
Genome Browser's database. The pre-defined bins file bins_5k_hg19.bed
is also in the data
directory of this repo, and defines the 5000
bins in the genome used for the CNV analysis. The first output file
bin_counts.bed
is like bedgraph format: it has the chrom, start and
end given in bins_5k_hg1.bed
but it also has two extra columns:
one is the count of reads in that bin, and the final column is those
counts divided by the average reads per bin. This information was
determined based on what is required in the next script.
In the next step we use an adaptation of a script originally due to Timour et al. (Nat. Protocols, 2014). The script is run as follows:
./cnvAnalysis.R bin_counts.bed SampleName bins_5k_hg19_gc.txt bins_5k_hg19_exclude.txt
The input file bin_counts.bed
is the same as described above. The
input file bins_5k_hg19_gc.txt
is the GC content of each bin. The
input bins_5k_hg19_exclude.txt
is used to exclude certain parts of
the genome that attract an unusual amount of reads. The format is
simply the line numbers, in the corresponding bed file, of the bins to
exclude from the CNV analysis. The first output is a PDF file
SampleName.pdf
for the CNV profile. In addition, two tables are
saved: one table SampleName.data.txt
with the information
(chromosome, genome position, GC content, bin count, segmented value)
for each bin, and the other table SampleName.short.txt
summerizing
the breakpoints in the CNV profile.
The simulation procedure takes mapped long reads and uses them to generate short fragments with a known mapping location. Since the mapping location is known, we can assess a mapping strategy by how well it recovers the known mapping locations. The steps are as follows.
-
Select and map long reads: The data set should be from WGS using the sequencing technology of interest (in our case, the Oxford MinION). Use a standard long-read mapper, and make sure the output is in SAM format.
-
Generate candidate short fragments: From the SAM format of the long-read mapping locations, generate candidate short fragments with known mapping locations. This is done with a script in the
scripts
directory as follows:$ ./getFragsFromLongReads.py -s long_reads_mapped.sam -l 100 > candidates.bed
The candidates are encoded in 6-col BED format with the name column (the 4th) encoding the DNA sequence of the fragment. The parameter
-l
indicates the length of the candidate fragments to generate. Only uniquely mapping long reads are used for generating candidate fragments. The coordinates for each line in thecandidates.bed
file are the reference genome mapping coordinates for the short fragment obtained by arithmetic on the long-read reference location accounting for indels in the mapping. -
Filter the candidates to exclude deadzones: Any short fragments that would not map uniquely without the rest of the long read are excluded by using bedtools and a file of deadzones. We obtain the deadzones using the program
deadzones
available fromhttp://github.com/smithlabcode/utils
and it will generate deadzones for a given read length. We used 40 bp, which is conservative if the candidate fragments are longer than 40 bp. The deadzone k-mer should not be larger than the candidate fragment size. The program is run as follows:$ ./deadzones -k 40 -o dz_hg19_40.bed hg19.fa
This will use lots of memory and might be slow, but it has parameters to make it use less memory and it only needs to be done once. To filter the candidate fragments, use
bedtools
as follows:bedtools intersect -v -a candidates.bed -b dz_hg19_40.bed > good_frags.bed
The
good_candidates.bed
will be used to generate the simulated SMURF-seq reads. -
Randomly combine fragments into long reads: This step uses another script from the
scripts
directory:./readsFromFrags.py -n FAB42704 -f 10 -l 100 -r 10000 \ -b good_frags.bed -o simulated.fa
Above, the
-n
parameter indicates an identifying for the original data set, which goes into the read name so we can map multiple simulations together but later take them apart and analyze the results separately. The-f
,-l
and-r
give the number of fragments, their length and the number of reads to simulate (the fragment length could be inferred from the input). In the output, the name of each simulated SMURF-seq read insimulated.fa
contains a list of the original mapping locations of each fragment in the reference genome, so later we can use these to evaluate mapping performance. -
Map the simulated reads: Select a mapping tools, set the parameters, and map the simulated reads in
simulated.fa
from the above step to the same reference genome used already. The output should be in SAM format, and we will assume it is namedsimulated_out.sam
. -
Evaluate the mapping performance: here we use a script from the
scripts
directory:./evalSmurfSim.py -f simulated.fa -s simulated_out.sam -l 100
This script computes precision and recall at the level of recovering fragments and at the level of correctly determining mapping location for each nucleotide in the simulated reads. In addition, the portion of each simulated read that is mapped in at least one identified fragment is reported, along with summary stats on the sizes of fragments identified by the mapper.
-
[Python:] All Python scripts here are in Python3 (we used 3.6.8). The following non-standard libraries are used: pysam (we used 0.15.0) and numpy (we used 1.15.0).
-
[R:] The R script
cnvAnalysis.R
uses the DNAcopy library.Seshan VE, Olshen A (2018). DNAcopy: DNA copy number data analysis. R package version 1.56.0.
-
[Software tools:] For the simulations/valuations we require
bedtools
(we used v2.26.0). We also require thedeadzones
program fromhttp://github.com/smithlabcode/utils
but this could be substituted for any means of excluding unmappable regions. In our CNV analysis, we usedbwa
(0.7.17).
- Andrew D. Smith andrewds@usc.edu
- Rishvanth K. Prabakar kaliappa@usc.edu
Copyright (C) 2019 The Authors
Authors: Rishvanth K. Prabakar and Andrew D. Smith
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.