A Spike-in Free ChIP-Seq Normalization Approach for Detecting Global Changes in Histone Modifications
Traditional reads per million (RPM) normalization method is inappropriate for the evaluation of ChIP-seq data when the treatment or mutation has the global effect. Changes in global levels of histone modifications can be detected by using exogenous reference spike-in controls. However, most of the ChIP-seq studies have overlooked the normalization problem that have to be corrected with spike-in. A method that retrospectively renormalize data sets without spike-in is lacking.
We observed that some highly enriched regions were retained despite global changes by oncogenic mutations or drug treatment and that the proportion of reads within these regions was inversely associated with total histone mark levels. Therefore, we developped ChIPseqSpikeInFree
, a novel ChIP-seq normalization method to effectively determine scaling factors for samples across various conditions and treatments, which does not rely on exogenous spike-in chromatin or peak detection to reveal global changes in histone modification occupancy. This method is capable of revealing the similar magnitude of global changes as the spike-in method.
In summary, ChIPseqSpikeInFree
can estimate scaling factors for ChIP-seq samples without exogenous spike-in or without input. When ChIP-seq is done with spike-in protocol but high variation of Spike-In reads between samples are observed, ChIPseqSpikeInFree can help you determine a more reliable scaling factor than ChIP-Rx method. It's not recommended to run ChIPseqSpikeInFree blindly without any biological evidences like Western Blotting to prove the global change at protein level between your control and treatment samples.
To use the tool, you will need to create a DNAnexus account at https://platform.dnanexus.com/register?client_id=sjcloudplatform. After logging in DNAnexus, you can create a project , upload your data to your project folder and choose the ChIPseqSpikeInFree app (Tools --> library --> search ChIPseqSpikeInFree) to run. Or you can run ChIPseqSpikeInFree [v1.2.3] at https://platform.dnanexus.com/app/ChIPseqSpikeInFree and get results in an hour.
ChIPseqSpikeInFree
depends on Rsamtools
, GenomicRanges
, and GenomicAlignments
to count reads from bam files.
To install these packages, start R
(version "3.5")or above, enter:
> if (!requireNamespace("BiocManager", quietly = TRUE))
> install.packages("BiocManager")
> BiocManager::install("Rsamtools")
> BiocManager::install("GenomicRanges")
> BiocManager::install("GenomicAlignments")
Using R
, enter:
# Install this package from GitHub
> install.packages("devtools")
> library(devtools)
> install_github("stjude/ChIPseqSpikeInFree")
> packageVersion('ChIPseqSpikeInFree')
#[1] '1.2.4'
Or using command lines
$ git clone https://github.com/stjude/ChIPseqSpikeInFree.git
$ R CMD build ChIPseqSpikeInFree
$ R CMD INSTALL ChIPseqSpikeInFree_1.2.4.tar.gz
A simple workflow in R
environment.
> library("ChIPseqSpikeInFree")
Save as /your/path/sample_meta.txt
(Example)
ID | ANTIBODY | GROUP |
---|---|---|
WT-H3K27me3.rmdupq1.bam | H3K27me3 | WT |
K27M-H3K27me3.rmdupq1.bam | H3K27me3 | K27M |
WT-INPUT.rmdupq1.bam | INPUT | WT |
K27M-INPUT.rmdupq1.bam | INPUT | K27M |
Note: INPUT samples are not required at all for a valid metadata file.
> metaFile <- "/your/path/sample_meta.txt"
> bams <- c("WT-H3K27me3.rmdupq1.bam","K27M-H3K27me3.rmdupq1.bam")
> ChIPseqSpikeInFree(bamFiles = bams, chromFile = "hg19", metaFile = metaFile, prefix = "test")
4. Run ChIPseqSpikeInFree
pipeline with custom settings for ChIP-seq with unideal enrichment or many very broad enriched regions like H3K9me3
> ChIPseqSpikeInFree(bamFiles = bams, chromFile = "hg19", metaFile = metaFile, prefix = "test", cutoff_QC = 1, maxLastTurn=0.97)
In the simple usage scenario, the user should have ChIP-seq bam files ready. Sample information can be specified in a metadata file (metaFile
) and the user should choose a correct reference genome corresponding to the bams.
User should follow ChIP-seq guidelines suggested by ENCODE consortium(Landt, et al., 2012)
and check the data quality first. Some steps to check quality and prepare ChIP-seq bam files for ChIPseqSpikeInFree
normalization:
- run SPP tool (Marinov, et al., 2014) to do ChIP-seq data QC and use samples with Qtag >= 1
- remove spike-in reads from your bam files
- only use good-quality or uniquely-mapped reads your bam files
java -jar picard.jar MarkDuplicates \ I=input.bam \ O=rmdup.bam \ M=rmdup_metrics.txt\ CREATE_INDEX=true \ ASSUME_SORTED=false \ REMOVE_DUPLICATES=true" samtools view -hb -q 1 rmdup.bam > rmdupq1.bam
- bam files must contain a header section and an alignment section
samtools view -H rmdupq1.bam
hg19
, mm9
, mm10
, hg38
are included in the package.
For other genomes, you can either:
-
Use
fetchChromSizes
to get it from UCSC, but not all genomes are available.http://hgdownload.soe.ucsc.edu/goldenPath/${DB}/bigZips/${DB}.chrom.sizes
Replace${DB}
with reference genome -
Generate directly from
fasta
file (Linux)$ samtools faidx genome.fa $ cut -f1,2 genome.fa.fai > genome.chrom.sizes
3. metaFile: A tab-delimited text file having three columns with a header line: ID
, ANTIBODY
, and GROUP
.
ID
is the bam file name of ChIP-seq sample that will be included for analysis.ANTIBODY
represents antibody used for ChIP.GROUP
describes the biological treatment or condition of this sample.
After you successfully run following ChIPseqSpikeInFree
pipeline:
> ChIPseqSpikeInFree(bamFiles = bams, chromFile = "hg19", metaFile = metaFile, prefix = "test")
Output will include: (in case that you set prefix ="test"
)
test_SF.txt
- text result (see Interpretation section or text file)- tab-delimited text format, a table of calculated scaling factors by pipeline
test_distribution.pdf
- graphical result (see Figure 1.A,B or PDF file)- view of proportion of reads below the given CPMW based on
test_parsedMatrix.txt
- view of proportion of reads below the given CPMW based on
test_boxplot.pdf
- graphical result (see Figure 1.C or PDF file)- view of scaling factors as boxplot based on
test_SF.txt
- view of scaling factors as boxplot based on
test_rawCounts.txt
- intermediate file- tab-delimited text format, a table of raw read counts for each 1kb bin across genome
test_parsedMatrix.txt
- intermediate file- tab-delimited text format, a table of proportion of reads below given cutoffs (CPMW)
ID | GROUP | ANTIBODY | COLOR | QC | SF | TURNS |
---|---|---|---|---|---|---|
WT-H3K27me3.rmdupq1.bam | WT | H3K27me3 | grey | pass | 1 | 0.25,0.3817,6.1,0.9523 |
K27M-H3K27me3.rmdupq1.bam | K27M | H3K27me3 | green | pass | 5.47 | 0.2,0.3429,34.7,0.9585 |
WT-INPUT.rmdupq1.bam | WT | INPUT | black | fail: complete loss, input or poor enrichment | NA | 0.2,0.1860,0.7,0.9396 |
K27M-INPUT.rmdupq1.bam | K27M | INPUT | grey | fail: complete loss, input or poor enrichment | NA | 0.15,0.0675,0.35,0.8476 |
- COLOR: defines color for each sample in cumulative distribution curve
*_distribution.pdf
- QC: quality control testing. QC failure indicates poor or no enrichment.
- SF: scaling factor. Only sample that passes QC will be given a SF and NA indicates sample with poor enrichment. Input sample is not required for SF calculation. Larger scaling factor indicates a lower level of total histone mark and vice versa. For example, H3K27me3 is globally decreased in H3.3 K27M cells compared to WT (SF, 5.47 vs 1).
- TURNS: the coordinates of two points [Xa, Ya, Xb, Yb] detected in cumulative distribution plot (proportion of reads below CPMW cutoffs) for slope-based SF calculation.
We can use SF to adjust original library size for differential analysis and generating bigwig files.
- for differential analysis with EdgeR
...
dat <- read.table("sample_SF.txt", sep="\t",header=TRUE,fill=TRUE,stringsAsFactors = FALSE, quote="",check.names=F)
SF <- dat$SF
dge <- DGEList(counts = counts, group = GROUP, norm.factors = SF)
...
- for differential analysis with DESeq2
...
dat <- read.table("sample_SF.txt", sep="\t",header=TRUE,fill=TRUE,stringsAsFactors = FALSE, quote="",check.names=F)
SF <- dat$SF
countData <- matrix(1:100,ncol=4)
condition <- factor(c("A","A","B","B"))
dds <- DESeqDataSetFromMatrix(countData, DataFrame(condition), ~ condition)
dds <- estimateSizeFactors(dds)
coldata<- colData(dds)
coldata$sizeFactor <- coldata$sizeFactor *SF
colData(dds) <- coldata
dds <- estimateDispersions(dds)
dds <- nbinomWaldTest(dds)
...
- pseudo-code for generation of bigwig files from a bed file
libSize=`cat sample1.bed|wc -l`
scale=15000000/($libSize*$SF)
genomeCoverageBed -bg -scale $scale -i sample1.bed -g mm9.chromSizes > sample1.bedGraph
bedGraphToBigWig sample1.bedGraph mm9.chromSizes sample1.bw
In some scenario, a histone mark can be absent in a cell type at a certain developmental stage or in a model that the histone writer gene has been knocked-out. In this case, you will see a "NA" in SF column and "fail: complete loss, input or poor enrichment" in QC column in the output file ${prefix}_SF.txt. It's unreasonable to calculate SF for this kind of samples alone. However, if you restore the histone mark by drug treatment or ectopic over-xpression of histone writer, you can have a valid SF. Now you may want to compare "NA" with non-"NA" SF for a histone mark to show a global change in bigwig file or differential analysis. Here is the pseudocode to transform SF:
dat<- read.table("sample_SF.txt", sep="\t",header=TRUE,fill=TRUE,stringsAsFactors = FALSE, quote="",check.names=F)
SF <- dat$SF
SF[is.na(SF)] <- 1
SF[!is.na(SF)] <- 1/SF[!is.na(SF)]
dat$SF <- SF
write.table(dat, "sample_SF_completeLoss.txt", sep="\t",quote=F,row.names=F, col.names=T)
This repository contains the following:
- source code
- documentation [PDF file]
- chromFile of human and mouse reference genome (
hg19
,mm9
,mm10
, andhg38
) - an example of
sample_meta.txt
Users will still need to source and provide the following:
- bam files
sample_meta.txt
- chromFile except
hg19
,mm9
,mm10
, andhg38
H Jin, LH Kasper, JD Larson, G Wu, SJ Baker, J Zhang, Y Fan*. (2019) ChIPseqSpikeInFree: A ChIP-seq normalization approach to reveal global changes in histone modifications without spike-in. Bioinformatics, https://doi.org/10.1093/bioinformatics/btz720
ChIPseqSpikeInFree is developed by Drs. Hongjian Jin and Yiping Fan at the St Jude Children's Research Hospital. For collaborations or any other matters, please contact yiping.fanATstjude.org.
>sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux
Matrix products: default
BLAS: /home/R/install/3.6.1/lib64/R/lib/libRblas.so
LAPACK: /home/R/install/3.6.1/lib64/R/lib/libRlapack.so
locale:
[1] C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] ChIPseqSpikeInFree_1.2.4 GenomicAlignments_1.20.1
[3] Rsamtools_2.0.3 Biostrings_2.52.0
[5] XVector_0.24.0 SummarizedExperiment_1.14.1
[7] DelayedArray_0.10.0 BiocParallel_1.18.1
[9] matrixStats_0.55.0 Biobase_2.44.0
[11] GenomicRanges_1.36.1 GenomeInfoDb_1.20.0
[13] IRanges_2.18.3 S4Vectors_0.22.1
[15] BiocGenerics_0.30.0
loaded via a namespace (and not attached):
[1] lattice_0.20-38 bitops_1.0-6 grid_3.6.1
[4] zlibbioc_1.30.0 Matrix_1.2-17 tools_3.6.1
[7] RCurl_1.95-4.12 compiler_3.6.1 GenomeInfoDbData_1.2.1