MPRA count analysis pipeline

MPRAmodel.R: A series of functions to assist in analysis of MPRA count tables

Quick Start:

	source("MPRAmodel.R")
	mpra_out <- MPRAmodel(<countsTable>, <attributesData>, <conditionData>, <filePrefix>, <negCtrlName>, <posCtrlName>, <projectName>)

Arguments

There are 3 files that this pipeline needs for input.

  • countsTable : Table of counts with header. Should be either the output of the MPRAcount pipeline, or in the same format.
  • attributesData : Attributes for each oligo including oligo name, SNP, chromosome, position, reference allele, alt allele, Allele (ref/alt), window, strand, project, haplotype. For Oligos named in the form chr:pos:ref:alt:allele:window(:haplotype) the scripts here and here can be used to generate the attributes table.
  • conditionData : 2 columns w/no header column 1 is replicates as found in the count table, column 2 indicates cell type. This can be done in the program of your choice, or taken from the output of MPRAcount.

Other arguments needed

  • exclList : ARRAY; List of cell types, in string format, that should be excluded from the analysis. This array is empty by default.
  • filePrefix : STRING; All written files will have this string included in the file name, if running the pipeline multiple times with different settings this is a good place to differentiate them
  • plotSave : LOGICAL; Default TRUE, indicator of whether or not to save non-normalization QC plots
  • altRef : LOGICAL; Default TRUE, indicator of how to sort alleles for cellSpecificTtest
  • method : STRING; Default "ss", indicator of method to be used to normalize the data
    • 'ss' : Summit Shift - shifts the l2fc density peak to line up with 0
    • 'ssn' : Summit Shift (Negative Controls Only) - shifts the peak of negative controls to 0
    • 'ro' : Remove outliers - remove oligos that don't have a p-value or have a p-value > 0.001 *'nc' : Negative Controls - normalize only the negative controls
  • negCtrlName : STRING; Default "negCtrl", how negative controls are indicated in the project column of the attributesData file
  • posCtrlName : STRING; Default "expCtrl", how positive controls are indicated in the project column of the attributesData file
  • projectName : STRING; Default "MPRA_PROJ", a generalized name for the overall project
  • tTest : LOGICAL; Default TRUE, perform cell type specific tTest to determine emVARs
  • DEase : LOGICAL; Default TRUE, use Differential Expression methods to detect Allele Specific Expression (this method can be found here)
  • correction : STRING; Default "BH", indicator of whether to use Benjamini Hochberg ("BH") or Bonferroni ("BF") for p-value correction
  • cutoff : INT; Default 0.01, significance cutoff for including alleles for skew calculation (tTest only)
  • upDisp : LOGICAL; Default TRUE, update dispersions with cell type specific calculations
  • prior : LOGICAL; Default FALSE, use betaPrior=F when calculating the cell type specific dispersions

Functions (in alphabetical order)

  • addHaplo - Add haplotype column to attribute data and resolve oligos with multiple projects, if oligos with multiple projects are listed separately, they will be collapsed into a single row, if oligos are identical but the SNP is different for them, an error will be thrown
    use :
    addHaplo(attributesData, negCtrlName, posCtrlName, projectName)
    
  • cellSpecificTtest - Function to perform TTest on individual cell types, determine the allelic skew, and identify emVARs
    use :
    cellSpecificTtest(attributesData, counts_norm, dups_output, ctrl_mean, exp_mean, ctrl_cols, exp_cols, altRef, correction, cutoff)
    
  • conditionStandard - Standardize condition data
    use :
    conditionStandard(conditionData)
    
  • dataOut - Retrieve output data for future functions - if only looking for results and not the plots this is the only function that needs to be called. Most functions should only need an output from here
    use :
    dataOut(countsData, attributesData, conditionData, exclList, altRef, file_prefix, method, negCtrlName, tTest, Dease, correction, cutoff, upDisp, prior)
    
  • DESkew - Function to determine the allelic skew of oligos and identify emVARs using the DESeq method to determine allelic skew
    use :
    DESkew(conditionData, counts_norm, attributesData, celltype, dups_output, prior)
    
  • expandDups - Expand IDs that denote duplicate oligos in count/DESeq results, or any dataframe with rownames that are separated by a ";"
    use :
    expandDups(output)
    
  • fileDate - Generate the date in YYYYMMDD format for separation of runs in the project
    use :
    fileDate()
    
  • MPRAmodel - Overall function which will run everything once called
    use :
    MPRAmodel(countsData, attributesData, conditionData, exclList, filePrefix, plotSave, altRef, method, negCtrlName, posCtrlName, projectName, tTest, Dease, correction, cutoff, upDisp, prior, …)
    
  • mpraScatter - Function to produce scatter plot of counts data for visualization of correlation between two samples
    use :
    mpraScatter(conditionData, countsOut, sampleX, sampleY, xmax, ymax, plotSave)
    
  • oligoIsolate - Remove Error, CIGAR, MD and position columns if necessary; aggregate count data with relation to the oligo, writes a copy of the collapsed raw counts
    use :
    oligoIsolate(countsData, file_prefix)
    
  • panel.cor - Internal function to produce correlation scatter plots
    use :
    panel.cor(x, y, digits, prefix, cex.cor, …)
    
  • panel.lm - Internal function to produce correlation scatter plots
    use :
    panel.lm(x, y, pch, col.lm, …)
    
  • panel.nlm - Internal function to produce correlation scatter plots
    use :
    panel.nlm(x, y, pch, col.lm, …)
    
  • plor_logFC - Function to produce plots showing the expression fold change vs. normalized tag counts
    use :
    plot_logFC(full_output, sample, negCtrlName, posCtrlName)
    
  • processAnalysis - Initial processing of files via DESeq analysis
    use :
    processAnalysis(countsData, conditionData, exclList) <br>
    
  • tagNorm - Normalize DESeq results and plot normalized densities for each cell type
    use :
    tagNorm(countsData, conditionData, attributesData, exclList, method, negCtrlName, upDisp, prior) <br>
    
  • tagSig - Replace dispersions of normalized dds with cell type specific dispersions, called within tagNorm does not need to be called by itself
    use :
    tagSig(dds_results, dds_rna, cond_data, exclList, prior)