/SEM

Size-based Expectation Maximum (SEM) Algorithm

Primary LanguageJavaMIT LicenseMIT

Size-based Expectation Maximum (SEM)

Introduction

SEM is a nucleosome calling package, which focuses on characterizing nucleosome types genome-wide. The following nucleosome metrics would be produced in the output:

  • Dyad location
  • Occupancy
  • Fuzziness
  • Nucleosome subtype mixture probability

SEM distinguishes nucleosome type according to the DNA fragment length protected by the nucleosome. For example, a canonical nucleosome protects ~147bp DNA under extensive MNase digestion, while a hexamer should protect relatively shorter DNA. SEM assumes each type of nucleosome has its own distinct fragment length distribution, it deconvolves the fragment length profile of all DNA fragments to infer each distribution's parameters.

Installation

SEM is available on bioconda. To install, simply run:

$ conda install -c bioconda sem

Also you can download the latest runnable JAR file from SEM releases.

Quick Start

Run SEM on a single MNase-seq experiment:

# clone the repo and run in the root directory
$ sem --expt data/test.bam --geninfo data/mm10.fa.fai --out test/ --numClusters 3 --threads 1

The outputs will be in the test directory in this example.

Usage

Required:
	--out <output file prefix>
	--geninfo <genome info file>
	--expt <file name> AND --format <SAM/BED/SCIDX>
	OR
	--design <experiment design file name to use instead of --expt and --ctrl>
General:
	--threads <number of threads to use (default=1)>
	--verbose [flag to print intermediate files and extra output]
	--config <config file: all options here can be specified in a name<space>value text file, over-ridden by command-line args>
Genome:
	--providedPotenialRegions <bed file to restrict nucleosome detection regions> 
Loading Data:
	--nonunique [flag to use non-unique reads]
	--mappability <fraction of the genome that is mappable for these experiments (default=0.8)>
	--nocache [flag to turn off caching of the entire set of experiments (i.e. run slower with less memory)]
Detecting nucleosome type:
	--numClusters <number of nucleosome types> 
	--providedBindingSubtypes <custom binding subtypes (format: mean variance weight, sum of weights = 1)> 
	--onlyGMM <only Run GMM without the following nucleosome calling steps, use it to optimize nucleosome subtype parameters>
Running SEM:
	--r <max. model update rounds, default=3>
	--alphascale <alpha scaling factor(default=1.0>
	--fixedalpha <minimum number of fragments a nucleosome should have (default=1, must >= 1)>
	--exclude <file of regions to ignore, usually blacklist regions>
	--consensusExclusion <consensus exclusion zone>

Detecting nucleosome subtypes

In SEM, Gaussian Mixture Model (GMM) is used on MNase-seq fragment size distribution to find the potential nucleosome subtypes, each nucleosome subtype is represented by a Normal Distribution with parameters mean and variance. When --numClusters is set as a positive integer, a finite GMM will be used to find out the parameters of each nucleosome subtype.

It's recommended to use Picard CollectInsertSizeMetrics first to check the distribution of fragment size distribution to decide the number of clusters. When there is no prior knowledge on the number of nucleosome subtypes, --numClusters can also be set as -1 to let SEM decide it by a Dirichlet Process Mixture Model (DPMM).

You can use --onlyGMM option to let SEM only run the nucleosome subtype characterization, the distribution of each nucleosome subtype would be plotted in intermediate-results directory in the results folder.

Users can also provide their own nucleosome subtype information instead of using SEM built-in functions by providing a file through --providedBindingSubtypes, file should be in the below format with tab delimited:

mean_1	var_1	weight_1
mean_2	var_2	weight_2
...

The sum of weights should be equal to 1

Running SEM

--r controls how many rounds of EM will be used, the default 3 rounds is able to return precise enough nucleosome predictions. This number can be increased to get more precise predictions but at the risk of overfitting.

--fixedalpha decides the threshold for nucleosome occupancy, during EM, all nucleosomes below this threshold will be terminated, which ensures all the remaining nucleosomes have occupancy >= fixedalpha

--consensusExclusion decides the exclusion zone between nucleosomes, the spacing between nucleosomes will be >= this threshold, deafult exclusion zone is 127bp.

Restrict nucleosome calling regions

Since nucleosomes are everywhere on the genome, it's both computational intensive and time consuming to do EM on all the nucleosomes, besides, not all nucleosomes are of interest sometimes. --providedPotenialRegions can accept a bed file to restrict the regions where SEM will do nucleosome finding in, an example of potential regions could be candidate cis-regulatory regions (ccREs) from ENCODE SCREEN project.

A potential issue when running SEM in this way is that large number of small pieces of regions could lead to imprecise nucleosome finding at the edge of regions. So it's recommended to expand each region then merge the regions close to each other, below is the comparison between original ccRE regions and merged ccRE regions by bedtools slop -b 500 and bedtools merge -d 500

example region of user provided regions

Note: even provided with a potential region file, SEM will still use all MNase-seq fragments on the genome to decide nucleosome subtypes, if you want a different behavior, please refer to Detecting nucleosome subtypes for how to provide customized nucleosome subtypes information.

Feel free to open new issues or contact the author Jianyu Yang (jmy5455@psu.edu) on usage or feature requests!