meyer-lab/tfac-ccle

Obtain processed methylation data

Closed this issue · 11 comments

Obtain processed methylation data

[CGfinal files] chromosome, dinucleotide, position, methylated counts, total counts, sample ID

You can download the data here. DO NOT COMMIT THIS TO THE REPOSITORY.

We should filter out sites that have fewer than 15 total reads. Then, sites should only be included if they meet this 15-read cutoff across all samples.

Another possibility is merging sites into genomic regions, but we can think about this later.

https://github.com/NuttyLogic/METSIM_HMG_Code

Should I remove reads with the following chromosome names?
11_gl000202_random
17_ctg5_hap1
17_gl000203_random
17_gl000204_random
17_gl000205_random
17_gl000206_random
19_gl000208_random
19_gl000209_random
1_gl000191_random
1_gl000192_random
21_gl000210_random
4_ctg9_hap1
4_gl000193_random
4_gl000194_random
6_apd_hap1
6_cox_hap2
6_dbb_hap3
6_mann_hap4
6_mcf_hap5
6_qbl_hap6
6_ssto_hap7
7_gl000195_random
8_gl000196_random
9_gl000198_random
9_gl000199_random
9_gl000200_random
9_gl000201_random
M

I assume the answer is to delete, but I wanted to ask as they seem to be shared across most/all patients and maybe have a more identifiable location.

Also, when I generate the ratio of reads, should I round the end value/to what decimal place should I round?

Why would you need to round?

I think just to make the file smaller.

You should keep it as counts in the file, or round to at least three decimal places.

Also, for the above chromosomes, should I remove those?

Yes.