/CHALM

Cell Heterogeneity Accounted cLonal Methylation (CHALM)

Primary LanguagePython

Cell Heterogeneity Accounted cLonal Methylation (CHALM)

Different from calculating the traditional mean methylation level of a predefined region (e.g., promoter CpG island), CHALM directly uses the aligned sequencing reads as input and quantifies the cell heterogeneity accounted clonal methylation level of this region, which are more powerful in predicting gene expression. CHALM can also calculate cell heterogeneity accounted clonal methylation ratio for single CpG sites which are often required for DMR/UMR detection analysis. Furthermore, in order to illustrate the importance of clonal information, CHALM provides a CNN deep learning framework to predict expression directly by aligned sequencing reads produced by high throughput methylation profiling technologies like whole genome bisulfite sequencing (WGBS).

Authors

Availability

All documents can be downloaded from GitHub link: https://github.com/JR0202/CHALM

Dependencies

  • samtools/0.1.19
  • anaconda/2.5.0

note: CHALM depends on above packages, but some tools included in CHALM may have different dependencies.

Installation

No installations are needed. Simply run the python scripts as: python <python scripts>

CHALM is tested on python 2.7.

Code Example (download the example data by the synapse id: syn20549956)

1. Calculate the methylation level of CpG sites

a. The traditional mean methylation level of CpG sites

python ../src/CHALM.py trad -d hg19.fa -x CG -i no-action -p -r -m 4 -o output_examples/CD3_primary_chr13_trad_CpG_methratio.txt CD3_primary_CGI_chr13.sam
# time cost: ~15 min

b. The cell-heterogeneity-accounted methylation ratio(CHALM) of CpG sites

python ../src/CHALM.py trad -d hg19.fa -x CG -i no-action -p -r -m 4 -l 1 -o output_examples/CD3_primary_chr13_CHALM_CpG_methratio.txt CD3_primary_CGI_chr13.sam
# time cost: ~15 min

2. Calculate the methylation level of predefined regions (e.g., promoter CGIs)

a. The traditional mean methylation level of promoter CGIs

python ../src/RegionMeth.py RegionMeth CGI_Gene_match_human_sorted_header.txt output_examples/CD3_primary_chr13_trad_CpG_methratio.txt -o CD3_primary_chr13_trad_meth_mean_promoter_CGI.txt
# time cost: ~1 min

b. The CHALM of promoter CGIs

python ../src/RegionMeth.py RegionMeth CGI_Gene_match_human_sorted_header.txt output_examples/CD3_primary_chr13_CHALM_CpG_methratio.txt -o CD3_primary_chr13_CHALM_promoter_CGI.txt
# time cost: ~1 min

3. This is another way to calculate promoter CGIs CHALM from sam file without generating CpG methylation (skipping Step 1)

python ../src/CHALM.py CHALM -d hg19.fa -x CG -R Human_CGI_bedfile.txt -L 99 -l 1 -p -r -o output_examples/CD3_primary_chr13_CHALM.txt CD3_primary_CGI_chr13.sam
# time cost: ~12 min

4. Calculate differential CHALM methylation level

python ../src/CHALM_dif.py -f1 output_examples/CD3_primary_chr13_CHALM.txt -f2 output_examples/CD14_primary_chr13_CHALM.txt -o output_examples/CD3_CD14_chr13_CHALM_dif.txt
# time cost: ~10 sec

note: for replicates, separate the files by "," (e.g., Condition_1_replicate1.txt,Condition_1_replicate2.txt,Condition_1_replicate3.txt)

5. Imputation to extend the sequencing read length

a. Dependencies

  • anaconda2/4.3.1
  • R/3.3.0
  • samtools/0.1.19

b. Command example

python ../src/CHALM_SVD_imputation.py -d hg19.fa -e 100 -x CG -R Human_CGI_bedfile.txt -L 99 -l 1 -p -r -o output_examples/CD3_primary_chr13_CHALM_extend_100.txt CD3_primary_CGI_chr13.sam
# time cost: ~11 min

6. Deep learning prediction of expression by CHALM

a. Dependencies

b. Command example

(1) Process aligned reads for deep learning
python ../src/Deep_learning_read_process.py -d hg19.fa -x CG -p -r -o output_examples -n CD3_primary --region Gene_CGI_match_TSS_sorted.txt --depth_cut 50 --read_bins 200 CD3_primary_CGI.sam
# time cost: ~15min

As control, add '-S' to disrupt the clonal information

python ../src/Deep_learning_read_process.py -d hg19.fa -x CG -p -r -S -o output_examples -n CD3_primary --region Gene_CGI_match_TSS_sorted.txt --depth_cut 50 --read_bins 200 CD3_primary_CGI.sam
# time cost: ~18min
(2) Train deep learning model and do expression prediction
python ../src/Deep_learning_prediction.py -f1 output_examples/CD3_primary_meth_2D_code.txt -f2 output_examples/CD3_primary_distance_2_TSS.txt -m output_examples/CD3_primary_trad_meth_mean_promoter_CGI.txt -e CD3_primary_RSEM.genes.results -s CD3_primary -d -o output_examples/
# time cost: ~5min

Train the control data (with disrupted clonal information)

python ../src/Deep_learning_prediction.py -f1 output_examples/CD3_primary_meth_2D_code_control.txt -f2 output_examples/CD3_primary_distance_2_TSS_control.txt -m output_examples/CD3_primary_trad_meth_mean_promoter_CGI.txt -e CD3_primary_RSEM.genes.results -s CD3_primary_control -d -o output_examples/
# time cost: ~5min

note: CD3_primary_RSEM.genes.results contains the expression level calculated by RSEM (rsem-calculate-expression)

7. Deep learning prediction of expression by CHALM (pre-trained model)

a. Command example

(1) Expression prediction of CD3 primary cell by pre-trained model
python ../src/Deep_learning_prediction_pretrained.py -f1 output_examples/CD3_primary_meth_2D_code.txt -f2 output_examples/CD3_primary_distance_2_TSS.txt -m output_examples/CD3_primary_trad_meth_mean_promoter_CGI.txt -e CD3_primary_RSEM.genes.results -s CD3_primary_pretrained --model pretrained_model.pt -d -o output_examples/
# time cost: ~5min

Train the control data (with disrupted clonal information)

python ../src/Deep_learning_prediction_pretrained.py -f1 output_examples/CD3_primary_meth_2D_code_control.txt -f2 output_examples/CD3_primary_distance_2_TSS_control.txt -m output_examples/CD3_primary_trad_meth_mean_promoter_CGI.txt -e CD3_primary_RSEM.genes.results -s CD3_primary_pretrained_control --model pretrained_model.pt -d -o output_examples/
# time cost: ~5min