HOME (histogram of methylation) is a python package for differential methylation region (DMR) identification. The method uses histogram of methylation features and the linear Support Vector Machine (SVM) to identify DMRs from whole genome bisulfite sequencing (WGBS) data. HOME can identify both pairwise and time series DMRs with or without replicates. #Installation HOME is written for python 2.7 and tested on Linux system. It is recommended to set up virtual environment for python 2.7 first before installing HOME package.
Step 1: Create a virtual environment for HOME
virtualenv -p <path_to_python2.7> <env_name>
NOTE : the above command assumes that the virtualenv tool is already installed on your system.
Step 2: Activate the virtual environment
source <env_name>/bin/activate
The name of the activated environment will appear on the left of the prompt.
Step 3: Install the HOME package outside virtual environment but make sure that the virtual environment is active
pip install git+https://github.com/ListerLab/HOME.git
or
git clone https://github.com/ListerLab/HOME.git
cd ./HOME
pip install -r requirements.txt
python setup.py install
*For conda users
git clone https://github.com/ListerLab/HOME.git
cd ./HOME
conda env create *assuming the conda environment is activated and R is already installed in it*
source activate HOMEenv
python setup.py install
HOME can be run in pairwise mode for two group comparisons and time series mode for more than two group comparisons. It can also be used for mutiple pairwise comparisions with large number of input samples.
BSSeeker2 CGmap file can be provided as input file directly or Input file format as mentioned below needs to provided
Chromosome number, position, strand, type (CG/CHG/CHH) where H is anything but G, methylated reads and total number of reads. For a sample, this information are saved in a single tab separated text file without header, which can be compressed or uncompressed. Below shows an example of such file:
chr1 15814 + CG 12 14
chr1 15815 - CG 15 21
chr1 15816 - CHG 1 9
chr1 15821 - CHH 7 22
chr1 15823 - CHH 0 2
chr1 15825 - CHH 11 19
DMR detection for two group comparison: Pairwise
HOME-pairwise -t [CG/CHG/CHH/CHN/CNN] -i [sample_file_fullpath] -o [output_directorypath]
Note: Please check the number of cores to use and set them by npp parameter (default is 8). Also for non-CG DMR prediction for huge genomes like mammalian genome use parameter -sin.
Example:
HOME-pairwise -t CG -i ./testcase/sample_file_CG.tsv -o ./outputpath
or (if CGmap files from BSSeeker2)
HOME-pairwise -t CG -i ./testcase/sample_file_CG.tsv -o ./outputpath --BSSeeker2
Required arguments:
-t --type Type of DMRs (CG /CHH/CHG/CHN/CNN)
-i --samplefilepath Sample file containing sample names and sample paths for each replicate (TAB sep); each sample info should be in different rows
-o --outputpath Path to the output directory
The default parameters for HOME are relatively permissive. To run HOME with more stringent setting please change the defaults parameters as below or higher:
HOME-pairwise -t CG -i ./testcase/sample_file_CG.tsv -o ./outputpath --delta 0.2 --minc 5
Optional arguments:
Parameter | Default | Description |
---|---|---|
-sc --scorecutoff | 0.1 | the score from the classifier for each C position |
-p --pruncutoff | 0.1 | the SVM score checked for consecutive Cs from both ends to refine the boundaries |
-npp -–numprocess | 8 | number of cores to be used |
-ml --minlength | 50 | minimum length of DMRs required to be reported |
-ncb --numcb | 5 | minimum number of Cs present between DMRs to keep them seperate |
-md -–mergedist | 500 | maximum distance allowed between DMRs to merge |
-prn --prunningC | 3 | number of consecutives Cs to be considered for pruning for boundary refinement2 |
-ns --numsamples | all | no.of samples to use for DMR calling; default takes all sample in the file |
-sp --startposition | 1st position | start position of sample in the sample file to use for timeseries DMR calling |
-BSSeeker2 --BSSeeker2 | False | input CGmap file from BSSeeker2 |
-mc --minc | 3 | minimum number of Cs in a DMR |
-sin --singlechrom | False | parallel code for single chromosome; npp will be used for parallel run for each chr |
-d --delta | 0.1 | minimum average difference in methylation required in a DMR |
-wrt --withrespectto | all | samples to use for DMR calling for pairwise comparisions with respect to specific samples |
-Keepall --Keepall | False | Keep all cytosine positions present in atleast one of the replicate |
Parameter –sc
HOME assigns a score from the classifier to each C position based on the histogram features. This score is then used to cluster the C’s and call the DMRs. The user can set a lower score cutoff if the DMRs seems to be missing or DMR boundaries seems to be missing the differential methylated cytosine near the boundaries. Similarly, the user can increase the cutoff if the boundaries of DMRs seems to be extended. The score ranges from 0-1 and the default is set to 0.1.
NOTE: the default score is set after rigorous testing on various data sets and need not to be varied in most of the cases. The default should only be changed after proper visualization of the DMRs on the browser.
Parameter –p
This parameter controls boundary accuracy of the DMRs. After the DMRs are called they are then pruned based on the scores for the consecutive C’s on both ends. The user can increase the cutoff if the boundaries seems to be extending too far. Similarly the user can lower the cutoff if the DMR boundaries seems to be missing the differential methylated C’s. The range is from 0 to 0.5 and the default is 0.1. Please note that similar to parameter –sc this parameter need not to be altered in most of the cases and should only be changed after proper visual inspection of the DMRs.
Parameter –ml
This parameters sets minimum length (in terms of base pairs) for a DMR to be reported. The DMRs below this length will be skipped and not reported in the filtered DMR output file. The default is 50bp.
Parameter –ncb
This parameter controls when the smaller DMRs should be merged into one. It controls minimum number of C’s required between DMRs to keep them as separate DMRs. It works in relation with parameter –md (described below). The default is 5 C’s. So, if the number of C’s are less than 5 and the distance is less than 500bp (default for --md) between two consecutive DMRs it will be merged into one single DMR.
Parameter –md
This parameter allows the user to set the merge distance between two consecutive DMRs. The defaults is 500bp.
Parameter –npp
This parameter allows the user to set the number of parallel process to run at a time. The default is 8.
Parameter –mc
This parameter allows the user to set minimum number of C’s present in a DMRs to be reported. Any DMR with less than the set value will not be reported in the filtered DMR file. The default is 5.
Parameter –d
This parameter sets minimum average methylation difference present for a DMR to be reported. The DMRs with less than the set value will not be reported in the filtered DMR file. The default is 0.1.
Parameter –prn
This parameter is used in relation to parameter –p (described above). This controls the number of consecutive C’s to be considered from both ends for boundary refinement. The default is 3. Alteration of this parameter should only be done after proper visual inspection of the DMRs.
Parameter –sin
This parameter is used if you want to parallel the code by single chromosome. The default is False, so the code will be parallel for all chromosomes. It should be used with huge chromosomes for example in case of non-CG DMR prediction for mammalian genome. If the genome size is small it is adviced not to use it. To turn it on just say -sin in the command line.
Parameter –ns
This parameter is used if you want to use selected number of samples from your sample input file. The default is False, so the code will use all the samples in the sample input file. It allows you to have as many samples as you want in your input file but control the number of samples to use for DMR calling.
Parameter –sp
This parameter is used if you want to select the samples from anywhere in your sample input file. This parameter is used along with ns parameter. The default is False, so the code will start from the 1st sample in the sample input file. It allows you to have as many samples as you want in your input file but control the samples to use for DMR calling.
Parameter –BSSeeker2
This parameter is used if you want to provide CGmap file directly. The default is False, so the code will require the files in the input format mentioned above. If the user have methyaltion output files from BSSeeker2, it can be provided directly. To turn it on just say -BSSeeker2 in the command line.
Parameter –Keepall
This parameter is used if you want keep all the positions present in any of the replicates. The default is False, so it will keep only those position that are present in all the replicates of a sample. This parameter is especially useful if the dataset contains more than 1 replicate and contains missing positions which are coverged in atleast one of the replicate. To turn it on just say -Keepall in the command line.
Parameter –wrt
This parameter is used if user want to run multiple parawise comparsions with recpect to only specific samples, instead of for all pairwise combinations. For example if user want to compare control (cnt) with all other samples, he/she can use below command:
HOME-pairwise -t [CG/CHG/CHH/CHN/CNN] -i [sample_file_fullpath] -wrt cnt -o [output_directorypath]
NOTE: more than one sample name can also be provided by -wrt, seperated by space
Output format
The output format is:
chr start end status numC mean_meth1 mean_meth2 delta Avg_coverage1 Avg_coverage2 len
Here, status refers to state of DMR (hyper/hypo). Mean_meth1 and mean_meth2 refers to mean methylation level for sample1 and 2 respectively. Delta refers to the difference in mean methylation level for two samples. Avg_coverage1 and avg_covereage2 gives the mean coverage for both samples.
DMR detection for more than two groups: time series
HOME-timeseries -t [CG/CHG/CHH/CHN/CNN] -i [sample_file_fullpath] -o [output_directorypath]
Example:
HOME-timeseries -t CG -i ./testcase/sample_file_CG.tsv –o /outputpath
Required arguments:
-t --type type of DMRs (CG /CHH/CHG/CHN/CNN)
-i --samplefilepath Sample file containing sample names and sample paths for each replicate (TAB sep); each sample info should be in different rows
-o –-outputpath path to the output directory
Optional arguments:
Parameter | Default | Description |
---|---|---|
-sc --scorecutoff | 0.5 | the score from the classifier for each C position |
-npp -–numprocess | 5 | number of cores to be used |
-ml --minlength | 50 | minimum length of DMRs required to be reported |
-ns --numsamples | all | no.of samples to use for DMR calling; default takes all sample in the file |
-sp --startposition | 1st position | start position of sample in the sample file to use for timeseries DMR calling |
-BSSeeker2 --BSSeeker2 | False | input CGmap file from BSSeeker2 |
-mc --minc | 4 | minimum number of Cs in a DMR |
-d --delta | 0.1 | minimum average difference in methylation required in a DMR |
-sin --singlechrom | False | parallel code for single chromosome; npp will be used for parallel run for each chr |
-Keepall --Keepall | False | Keep all cytosine positions present in atleast one of the replicate |
Output format
chr start end numC len max_delta confidence_scores Comb1-n
Here, Max_delta is the maximum average methylation difference among the compared samples. Confidence score takes into account the length, number of C’s and SVM score. The higher value denotes more confident DMR. Comb1-n denotes the pairwise comparisons for each combination of samples. It reports start: end: state: delta for each pairwise comparison.
To stop HOME execution in middle:
cntrl+c and then
cntrl+z
Note: always delete the directories created by HOME run if stopped in middle
Error Exception: File...training_data file does not exist
Please remember to CD into HOME first before starting the run
Error while installing dependencies from requirement file: for example "Compilation failed : pyconfig.h: No such file or directory"
sudo apt-get install python-dev
If you use this software in your work, please cite our paper HOME