Obtain processed methylation data
Closed this issue · 11 comments
[CGfinal files] chromosome, dinucleotide, position, methylated counts, total counts, sample ID
We should filter out sites that have fewer than 15 total reads. Then, sites should only be included if they meet this 15-read cutoff across all samples.
Another possibility is merging sites into genomic regions, but we can think about this later.
Should I remove reads with the following chromosome names?
11_gl000202_random
17_ctg5_hap1
17_gl000203_random
17_gl000204_random
17_gl000205_random
17_gl000206_random
19_gl000208_random
19_gl000209_random
1_gl000191_random
1_gl000192_random
21_gl000210_random
4_ctg9_hap1
4_gl000193_random
4_gl000194_random
6_apd_hap1
6_cox_hap2
6_dbb_hap3
6_mann_hap4
6_mcf_hap5
6_qbl_hap6
6_ssto_hap7
7_gl000195_random
8_gl000196_random
9_gl000198_random
9_gl000199_random
9_gl000200_random
9_gl000201_random
M
I assume the answer is to delete, but I wanted to ask as they seem to be shared across most/all patients and maybe have a more identifiable location.
Also, when I generate the ratio of reads, should I round the end value/to what decimal place should I round?
Why would you need to round?
I think just to make the file smaller.
You should keep it as counts in the file, or round to at least three decimal places.
Also, for the above chromosomes, should I remove those?
Yes.