gifford-lab/CpGenie

Question about dataset preprocessing

Closed this issue · 3 comments

h5li commented

Hi, I am also working on predicting DNA methylations from DNA sequences.

But I have a question about your CNN and dataset to be trained on. Is dataset labels binary? ie, [0,1] or an array with two values, ie, probability of unmethylated or methylated? Like [0.2,0.8]. Also, how did you preprocess those sequences with multiple reads/experiments?

I appreciate your time to answer this question.

All the best,
Han

Hi Han, thank you for your interest in our work. We binarized the labels. The details of this and how we merged different experiments can be both found in the paper.

h5li commented

Thanks for replying!
Just one more question!

May you specify which part of this paper on how you merged different experiments? I only found some in the section of "High-throughput DNA methylation data".
You said that "We merged multiple replicates for the same experiments, and where a CpG exists in all replicates we merged the counts of methylated and unmethylated reads and re-calculated the percentage of methylation."

Does it mean that with 2 methylated reads and 8 unmethylated reads, the percentage of methylation 20%, you binarized it into 0?

All the best,
Han.

@h5li Yes in this case the methylation is 20% (not 80%) and we binarized as 0. The 50% binarization cutoff is determined based on the highly bimodal distribution of methylation percentage across the CpG sites on the genome (centered at 0 and 1).

Haoyang