An R package for generating barcoded Massively Parallel Reporter Assay sequences
If you make use of this software, please cite the following publication:
Andrew R Ghazi, Edward S Chen, David M Henke, Namrata Madan, Leonard C Edelstein, Chad A Shaw; Design tools for MPRA experiments, Bioinformatics, Volume 34, Issue 15, 1 August 2018, Pages 2682–2683, https://doi.org/10.1093/bioinformatics/bty150
MPRA Design Tools depends on the Biostrings and BSgenome.Hsapiens.UCSC.hg38 packages from Bioconductor. First install these in R with the following commands:
source("https://bioconductor.org/biocLite.R")
biocLite("Biostrings")
biocLite("BSgenome.Hsapiens.UCSC.hg38")
The package also makes use of some tidyverse packages which can be installed with the following commands:
install.packages(c('dplyr', 'magrittr', 'purrr', 'readr', 'stringr', 'tibble', 'tidyr', 'purrrlyr'))
If you don't have the devtools package installed, install it like so:
install.packages("devtools")
After that you can install and load MPRA Design Tools with these commands:
devtools::install_github('andrewGhazi/mpradesigntools')
library(mpradesigntools)
This is the companion package to the MPRA Design Tools Shiny application available here: https://andrewghazi.shinyapps.io/designmpra/
The Shiny app allows users to interact with MPRA parameters (such as number of barcodes per allele) and see the effect of changing parameters on the assays power. Researchers can use this to decide what parameters best meet their experimental goals.
Currently the main function of MPRA Design Tools package is to design a set of barcoded sequences for MPRA experiments (without overloading our Shiny server!). This is done with the processVCF
function. It takes roughly 5 seconds + 10ms per barcoded sequence on a relatively modern CPU, so you can estimate the expected job time in seconds as
5 + .01 * Number of barcodes per allele * Number of SNPs in VCF * 2 (for ref/alt alleles)
Only the CHROM, POS, REF, and ALT columns are used. The INFO column is used only for detecting reverse strand constructs.
Current input constraints are:
- Insertions and deletions must encode the reference and alternate alleles (respectively) as a dash character '-'.
- Multiple alternate alleles should be separated in the ALT field by a comma and no spaces
- By default, the program pulls the sequence context from the forward (+) strand of the reference genome. If the user wishes to generate SNPs for genes that normally are read from the reverse strand, add a string containing "MPRAREV" to the INFO field of the VCF. This will ensure that the genomic context gets inserted with the correct orientation relative to the minimal promoter and barcode in the reporter plasmid.
- Alleles should be specified by the alleles present on the forward (+) strand. A small fraction of entries in official dbSNP VCFs are specified by their reverse strand alleles, which is denoted by the RV tag in the INFO field. These need to be flipped manually at the moment, automated handling is planned for a future release.
VCFs generated by batch querying rsID's on dbSNP should meet most of the formatting requirements. However the MPRAREV tag will need to be added by the user (where appropriate) because the VCF's do not always specify which strand the relevant gene is on.
9/17/18 - Feature under development
Alternative barcode sets may be used by specifying the barcode_set
argument to processVCF
one of the following values. The first number indicates the length of the barcodes in basepairs, the second indicates the number of errors correctable while still being able to identify the original barcode. Note that these barcodes CAN include miR seed sequences. If you want to avoid miR interference, identify the main miRs by abundance in your cell type of interest, then include their seed sequences in the filterPatterns
argument. These barcodes are provided by the freebarcodes package, detailed at the publication below and available from the subsequently listed github repository.
The original barcode set provided with mpradesigntools is available as the twelvemers
barcode set.
barcode_set | n_barcodes |
---|---|
barcodes10-1 | 1902 |
barcodes10-2 | 30 |
barcodes11-1 | 6160 |
barcodes11-2 | 74 |
barcodes12-1 | 17213 |
barcodes12-2 | 178 |
barcodes13-1 | 56735 |
barcodes13-2 | 467 |
barcodes14-1 | 157196 |
barcodes14-2 | 1155 |
barcodes15-1 | 518508 |
barcodes15-2 | 3182 |
barcodes16-1 | 1636417 |
barcodes16-2 | 8776 |
barcodes17-2 | 23024 |
barcodes3-1 | 1 |
barcodes4-1 | 2 |
barcodes5-1 | 9 |
barcodes5-2 | 1 |
barcodes6-1 | 26 |
barcodes6-2 | 1 |
barcodes7-1 | 66 |
barcodes7-2 | 3 |
barcodes8-1 | 212 |
barcodes8-2 | 6 |
barcodes9-1 | 553 |
barcodes9-2 | 11 |
twelvemers | 1140292 |
Indel-correcting DNA barcodes for high-throughput sequencing, John A. Hawkins, Stephen K. Jones, Ilya J. Finkelstein, William H. Press, Proceedings of the National Academy of Sciences Jul 2018, 115 (27) E6217-E6226; DOI: 10.1073/pnas.1802640115
https://github.com/finkelsteinlab/freebarcodes
processVCF(vcf = '/path/to/the.vcf',
nper = 14,
upstreamContextRange = 55,
downstreamContextRange = 55,
outPath = '/path/to/the/output.tsv',
fwprimer = 'ACTGGCCGCTTCACTG',
revprimer = 'AGATCGGAAGAGCGTCG',
alter_aberrant = TRUE,
extra_elements = FALSE,
max_construct_size = 170,
barcode_set = 'barcodes14-1',
ensure_all_4_nuc = TRUE)
Once you've performed your MPRA and have your sequencing results, check out malacoda for QC and statistical analysis of your results!
- mm10 genomic context
- parallelization
randomized alterations to aberrant digestion sites✔- bed file to Sharpr-MPRA library oligo design
automated handling of RV SNPs✔Optimized barcode pools✔
If you are interested in a subset of these features or have other feature requests, please let us know to inform our implementation prioritization. You can do so by opening an issue on this repository or contacting the first and corresponding authors of the publication, listed above.