/PeakSegPipeline

Supervised machine learning pipeline for peak calling in ChIP-seq data

Primary LanguageR

https://travis-ci.org/tdhock/PeakSegPipeline.png?branch=master

PeakSegPipeline: an R package for genome-wide supervised ChIP-seq peak prediction, for a single experiment type (e.g. broad H3K36me3 or sharp H3K4me3 data), jointly using multiple samples and cell types.

  • Labeling. The first step of any supervised analysis is to label a few genomic regions with and without peaks in your data. The PeakSegPipeline::create_track_hub function creates a track hub from bigWig files, which makes it easy to visualize data on the UCSC genome browser and then create labels (specific genomic regions where you have observed presence or absence of peaks in specific samples). For more details about labeling see Input Files -> Label Data below.
  • Single-sample peak calling using PeakSegFPOP. This repository includes a limited memory implementation of PeakSegFPOP, a PeakSeg functional pruning optimal partitioning algorithm (arXiv:1703.03352), which is used to predict preliminary peaks for each sample independently.
  • Multiple-sample peak calling using PeakSegJoint. After single-sample analysis, peaks on different samples are clustered into genomic regions with overlapping peaks. In each cluster we run PeakSegJoint to find the most likely common peak positions across multiple samples – this includes a prediction of which samples are up and down for each genomic region.

There are two major differences between PeakSegPipeline and all of the other peak detection algorithms for ChIP-seq data analysis:

  • Supervised machine learning: PeakSegFPOP and PeakSegJoint are trained by providing labels that indicate regions with and without peaks. So if you see false positives (a peak called where there is clearly only background noise) or false negatives (no peak called where there should be one) you can add labels to correct the models, and they learn and get more accurate as more labels are added. In contrast, other peak detection methods are unsupervised, meaning that they usually have 10-20 parameters, and no obvious way to train them, yielding arbitrarily inaccurate peaks that can NOT be corrected using labels.
  • Joint peak detection in any number of samples or cell types so the model can be easily interpreted to find similarities or differences between samples (PeakSegPipeline outputs a binary matrix, samples x peaks). In contrast, it is not easy to find similarities and differences between samples using single-sample peak detection methods (e.g. MACS), and other multi-sample peak detection methods are limited to one (e.g. JAMM) or two (e.g. PePr) cell types (assuming all samples of the same cell type are replicates with the same peak pattern).

Installation and testing

Install UCSC command line programs, as explained on our FAQ. These are necessary because the coverage data must be stored in bigWig files for efficient indexed access.

Install a C++ compiler and a BerkeleyDB STL development library, as explained in our FAQ. BerkeleyDB STL is used by PeakSegPipeline::PeakSegFPOP_disk to write a large temporary file to disk.

Finally, install the PeakSegPipeline R package.

if(!require(devtools))install.packages("devtools")
devtools::install_github("tdhock/PeakSegPipeline")

Once everything has been installed, you can test your installation by executing tests/testthat/test-pipeline-demo.R It will first download some bigWigs and labels to the test/demo directory, then run the PeakSegPipeline on them. If everything worked, you can view the results by opening test/demo/index.html in a web browser, and it should be the same as the results shown on http://cbio.mines-paristech.fr/~thocking/hubs/test/demo/

Input Files

PeakSegPipeline uses PeakSegFPOP + PeakSegJoint to predict common and different peaks in multiple samples. It requires three kinds of input data:

  • coverage data under project/samples,
  • labels in project/labels,
  • genomic segmentation problems in project/problems.bed.

To give a concrete example, let us consider the data set that is used when you run tests/testthat/test-demo.R

Coverage data

Each coverage data file should contain counts of aligned sequence reads at every genomic position, for one sample. These files must be in bigWig format, since it is indexed for fast lookup of coverage in arbitrary genomic regions. For example this test downloads 8 files:

test/demo/samples/bcell/MS026601/coverage.bigWig
test/demo/samples/bcell_/MS010302/coverage.bigWig
test/demo/samples/Input/MS002201/coverage.bigWig
test/demo/samples/Input/MS026601/coverage.bigWig
test/demo/samples/Input_/MS002202/coverage.bigWig
test/demo/samples/Input_/MS010302/coverage.bigWig
test/demo/samples/kidney/MS002201/coverage.bigWig
test/demo/samples/kidney_/MS002202/coverage.bigWig

In the example above we have the test/demo directory which will contain all data sets, labels, and peak calls for this particular project. The samples directory contains a sub-directory for each sample group (experimental conditions or cell types, e.g. bcell or kidney). Each sample group directory should contain a sub-directory for each sample (e.g. MS002201 or MS010302). Each sample sub-directory should contain a coverage.bigWig file with counts of aligned sequence reads (non-negative integers).

Note that in this demonstration project, the groups with underscores are un-labeled samples (e.g. bcell_), and the groups without underscores are labeled samples (e.g. bcell). In real projects typically you would combine those two groups into a single labeled group, but in this project we keep them separate in order to demonstrate the prediction accuracy of the learning algorithm.

Label Data

The project/labels/*.txt files contain genomic regions with or without peaks. These labels will be used to train the peak prediction models (automatically select model parameters that yield optimal peak prediction accuracy). A quick and easy way to create labels is by visual inspection as in the McGill ChIP-seq peak detection benchmark (for details please read Hocking et al, Bioinformatics 2016).

To visually label your data first create a project directory on a webserver with project/samples/groupID/sampleID/coverage.bigWig files, then create a track hub using a command such as

R -e 'PeakSegPipeline::create_track_hub("project_dir", "http://your.server.com/~user/path-", "hg19", "email@domain.com")'

The arguments of the create_track_hub function are as follows:

  • The first argument project is the data directory.
  • The second argument http://your.server.com/~user/path- is the URL prefix (appended before the first argument to obtain URLs for the trackDb.txt file).
  • The third argument hg19 is the UCSC genome ID for the genomes.txt file.
  • The fourth argument email@domain.com is the email address for the hub.txt file.

If that command worked, then you should see a message Created http://your.server.com/~user/path-project/hub.txt and then you can paste that URL into My Data -> Track Hubs -> My Hubs then click Add Hub to tell the UCSC genome browser to display your data. Navigate around the genome until you have found some peaks, then add positive and negative labels in project/labels/*.txt files.

For example the test data set contains only one labels file,

test/demo/labels/some_labels.txt

which contains lines such as the following

chr10:33,061,897-33,162,814 noPeaks
chr10:33,456,000-33,484,755 peakStart kidney
chr10:33,597,317-33,635,209 peakEnd kidney
chr10:33,662,034-33,974,942 noPeaks

chr10:35,182,820-35,261,001 noPeaks
chr10:35,261,418-35,314,654 peakStart bcell kidney

A chunk is a group of nearby labels. In the example above there are two chunks (far apart genomic regions, separated by an empty line). The first chunk has two regions with noPeaks labels in all samples, and two regions with positive labels in kidney samples and noPeaks labels in bcell samples. The second chunk has one region with noPeaks in bcell and kidney samples, and one region with a peakStart label in bcell and kidney samples.

In general, the labels file is divided into separate chunks by empty lines. Each chunk should contain lines for several nearby genomic regions, the corresponding label (noPeaks, peakStart, peakEnd, peaks), and the sample groups to which that label should be assigned (all other groups mentioned in the labels file will receive the noPeaks label). Ideally, each chunk should contain

  • At least one label with a peak in all samples.
  • At least one label with no peaks in any samples.
  • At least one label with a peak in some samples but not others (these labels are crucial for the model to be able to learn what is a significant difference between up and down).

Visualizing labels. After having added some labels in project/labels/*.txt files, run Rscript convert_labels.R project to create project/all_labels.bed. Then when you re-run Rscript create_track_hub.R ... it will create a new hub with a track “Manually labeled regions with and without peaks” that displays the labels you have created.

Genomic segmentation problems

The last input file that you need to provide is a list of separate segmentation problems for your reference genome (regions without gaps). This file should be in BED format (e.g. hg19_problems.bed).

If you don’t use hg19, but you do use another standard genome that is hosted on UCSC, then you can use PeakSegPipeline::downloadProblems.

Rscript -e 'PeakSegPipeline::downloadProblems("hg38", "hg38_problems.bed")'

If your reference genome does not exist on UCSC, you can use PeakSegPipeline::gap2problems to make a problems.bed file.

Rscript -e 'PeakSegPipeline::gap2problems("yourGenome_gap.bed", "yourGenome_chromInfo.txt", "yourGenome_problems.bed")'

where the chromInfo file contains one line for every chromosome, and the gap file contains one line for every gap in the reference (unknown / NNN sequence). If there are no gaps in your genome, then you can use yourGenome_chromInfo.txt as a problems.bed file.

Running steps of the pipeline in parallel

The first step is to convert label text files to bed files:

Rscript -e 'PeakSegPipeline::convert_labels("test/demo")'

Since the human genome is so large, we recommend to do model training and peak prediction in parallel. To use a PBS/qsub cluster such as Compute Canada’s guillimin, call PeakSegPipeline::create_problems_all with a PBS.header argument that reflects your cluster configuration:

Rscript -e 'PeakSegPipeline::create_problems_all("test/demo")'

That will create problem sub-directories in test/demo/samples/*/*/problems/*. Begin model training by computing target.tsv files:

for lbed in test/demo/samples/*/*/problems/*/labels.bed;do qsub $(echo $lbed|sed 's/labels.bed/target.tsv.sh/');done

The target is the largest interval of log(penalty) values for which PeakSegFPOP returns peak models that have the minimum number of incorrect labels. The target.tsv files are used for training a machine learning model that can predict optimal penalty values, even for un-labeled samples and genome subsets. To train a model, use

Rscript -e 'PeakSegPipeline::problem.train("test/demo")'

which trains a model using test/demo/samples/*/*/problems/*/target.tsv files, and saves it to test/demo/model.RData. To compute peak predictions independently for each sample and genomic segmentation problem,

for sh in test/demo/problems/*/jointProblems.bed.sh;do qsub $sh;done

which will launch one job for each genomic segmentation problem. Each job will make peak predictions in all samples, then write test/demo/problems/*/jointProblems/* directories with target.tsv.sh and peaks.bed.sh scripts. One directory and joint segmentation problem will be created for each genomic region which has at least one sample with a predicted peak. To train a joint peak calling model, run

qsub test/demo/joint.model.RData.sh

which will compute test/demo/joint.model.RData and test/demo/jobs/*/jobProblems.bed files. To make joint peak predictions, run

for sh in test/demo/jobs/*/jobPeaks.sh;do qsub $sh;done

To gather all the peak predictions in a summary on test/demo/index.html, run

qsub test/demo/peaks_matrix.tsv.sh

Finally, you can create test/demo/hub.txt which can be used as a track hub on the UCSC genome browser:

Rscript test/demo/hub.sh

The script will create test/demo/samples/*/*/coverage.bigWig and test/demo/samples/*/*/joint_peaks.bigWig files that will be shown together on the track hub in a multiWig container (for each sample, a colored coverage profile with superimposed peak calls as horizontal black line segments).

Output Files

The PeakSegPipeline::plot_all function creates

  • index.html a web page which summarizes the results,
  • peaks_matrix.tsv a binary matrix (peaks x samples) in which 1 means peak and 0 means no peak.
  • peaks_summary.tsv is a table with a row for each genomic region that has a peak in at least one sample. The columns are
    • chrom, peakStart, peakEnd genomic region of peak.
    • specificity if you have labeled peaks in Input samples, the model labels each peak as either specific (few Input samples up), or non-specific (many Input samples up). If you want to filter non-specific Input peaks yourself, you can use the n.Input column, which is the number of Input samples with a peak in this region.
    • loss.diff the likelihood of the peak (larger values mean taller and wider peaks in more samples).
    • chisq.pvalue, fisher.pvalue P-Values from Chi-Squared (chisq.test) and Fisher’s exact test (fisher.test) for whether or not this peak is group-specific (lower values mean strong correlation between peak calls and groups).

Related work

  • PeakSegOptimal::PeakSegFPOP provides a O(n log n) memory (and no disk usage) implementation of the PeakSegFPOP algorithm for separately calling peaks for every sample and genomic problem. In contrast PeakSegPipeline::PeakSegFPOP_disk implements the same algorithm using O(log n) memory and O(n log n) disk space (which is highly unlikely to memory swap, but a bit slower on large data sets). The PeakSegFPOP command line program is another on-disk implementation which can be used outside of R.
  • The PeakSegJoint package is used by PeakSegPipeline, for its algorithms for joint peak calling across any number of samples and cell types.
  • The penaltyLearning package is used by PeakSegPipeline, for its supervised learning algorithms (interval regression) which are used to predict model complexity (log penalty = number of peaks).
  • The PeakError package is used by PeakSegPipeline, to compute the number of incorrect labels for each peak model.