1709 CCWGG motifs in E.coli stationary phase cells
PengNi opened this issue · 18 comments
Hi,
Does anyone know where to find the positions of the 1,709 motifs used in the signalAlign paper in Nature Mehods?
I just cannot figure out how to parse these motifs from the raw data.
Thank you very much!
Peng
I could not be more appreciate if you tell me. @mitenjain , @ArtRand
Hi Peng,
Have you checked this paper: https://www.nature.com/articles/ncomms1878
This was the source of our motif positions. The call file is this, one that includes calls for both strands: https://github.com/ArtRand/CytosineMethylationAnalysis/blob/master/data/ecoli/test_sites.tsv
Hope this helps.
-Miten
Thanks Miten,
This helps a lot.
I checked the ncomms1878 paper. However, I'm still not sure if I can parse the same 1709 motifs from this paper as you do.
So does the test_sites.tsv contain all the positions of the 1709 motifs mentioned in the signalAlign paper? To me it seems like that the test_sites.tsv is for the testing. If that so, could you give me the other part motifs for training?
Many thanks.
Peng
Hi Peng,
Yes, that file should include everything I think. The test sites here is indicating sites to look at during both training and testing. There is hopefully more information in the Supp. Note 2.9 Dividing E. coli methylation motifs into training and test groups.
From the results of bisulfite sequencing performed on stationary phase E. coli K12 MG1655 we parsed 1,709 motifs, A, with the sequence CCWGG where the innermost cytosine is methylated [6] (W refers to either A or T). Analysis of the genome showed that 456 different 6-mers occur at the motifs centering around the second cytosine, let this set be denoted as K. A training group of motifs, T, was generated by randomly drawing motifs from A until all 6-mers in K were observed. Additional motifs from A were added at random until T contained roughly half of the total motifs. The remaining motifs were assigned to the test group, R, such that T ∩ R = Ø. The same groupings were used in experiments with 5-mers.
Let me know as you have any questions.
Best regards,
Miten
Hi Miten,
Thanks very much for giving me the information with the details.
I read the Supp. Note 2.9 before, and I've done my own test. However, I still have some questions:
(1) The paper mentioned 1709 CCWGG motifs, but I only get 1708 positions (854 on each strand) from test_sites.tsv.
More, in your paper you said "We evenly divided 3,418 constitutively methylated cytosines into a training and testing set." I don't quite understand what the 3418 methylated cytosines mean. Does this mean that there are 1709 cytosines on each group data of two samples (pcrDNA, gDNA)?
(2) Supp. Note 2.9 says that there are 456 6-mers. However, I only got 451 different 6-mers.
To get the 6-mers, fist I extracted all 11-mers where the inner cytosines of the 1708 motifs are in the middle. Then I put all six 6-mers from every 11-mer together. Does this the way you did? Or this is because that I used a different ecoli genome reference? I got a K12 MG1655 genome from ftp://ftp.ensemblgenomes.org/pub/release-29/bacteria//fasta/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655/dna/Escherichia_coli_str_k_12_substr_mg1655.GCA_000005845.2.29.dna.genome.fa.gz.
Thanks!
Best,
Peng
Hi Peng,
I am sending a file that does have 1709 motifs. It may be that the test sites files was missing one due to a leave-one out test (I will check this with @ArtRand). Hope this is useful.
ecoli_all_both.txt
This should hopefully also help with the 6mers.
-Miten
Hi Miten,
With the new file, the number of motifs and 6mers match with the paper.
Thank you very much.
Best,
Peng
Hi @PengNi ,
In bisulfite data you also get the # of cytosine calls (and # of 5mC calls). If I recall correctly, that is what we used to get the high-confidence motifs, i.e. cytosines which were all called as methylated (100% with at least >5 or >10 of calls from bisulfite). For processing with Bismark and Bowtie, we used the standard parameters, making sure of the versions as recommended in the ncomms1878 paper.
Hope this is useful.
-Miten
Thanks for telling me that, Miten. It is really helpful.
Best,
Peng
Hi @PengNi ,
Unfortunately a lot of the files and documentation that we had parsed in 2014-15 for these data were lost during a server move last month (to conserve space). We will have to re-parse some of our own data in order to figure out whats going on. We will be in touch as we get to re-processing the bisulfite data.
I will also try to find the correspondence that I had with the authors of the NComms E. coli paper and send you any details I find if they may be useful.
Best regards,
Miten
PS: I saw your DeepSignal bioRxiv, very cool work. Congratulations and good luck :)
Hi Miten ,
That would be wonderful! I truly appreciate that.
Best,
Peng
HI @PengNi ,
I haven't yet had a chance to parse data but if you send me an email (miten@soe.ucsc.edu) I can forward you processed data from the NComms paper.
Best regards,
Miten
Dear @mitenjain,
I am working on implementing and training my own model for methylation calling.
Thanks a lot for the above explanations and providing the datasets. I have read through the above issue and have some questions.
I obtained a dataset from the following link :
https://sra-pub-src-1.s3.amazonaws.com/SRR5219626/puc_minion.tar.gz.1
which contains many fast5 files. How exactly should I use these fast5 files, the file called 'ecoli_all_both.txt' and a reference genome, for creating a dataset for training and testing? I assume I need raw nanopore signals, corresponding nucleotide sequence, motif position and the methylation label as inputs. Is that correct?
It seems like a bit of dirty work if I understand correctly which makes it very laborous to replicate the experiments mentioned on the paper. Is it possible for you to provide the dataset ready to use, i.e. the raw nanopore signals, methylation label, position of the metylation etc.
Thanks a lot in advance
Arda
Dear @ardakdemir,
Sorry for the hassle. It is worth noting that all of the data from the paper are outdated from chemistry, basecalling, and informatics standpoint now. It may be worth working from the human genome dataset (https://github.com/nanopore-wgs-consortium/NA12878/blob/master/Genome.md) and training your models using the new (and in the process of being updated) signalAlign repository (https://github.com/UCSC-nanopore-cgl/signalAlign). If you want the the older raw data (and information), I am happy to share and help as possible.
Best regards,
Miten
Dear @ardakdemir ,
There are some public bisulfite sequencing data. For human samples they would be the place to start. We are generating and analyzing more data that could serve useful in this regard. Those should release in a few months.
You could also collect some publicly available data (like GM12878) and basecall using Guppy with the CpG methylation model. This will generate some methylation labels that you could correlate with bisulfite data to create a set of training sites for your own model.
Does that make sense?
Best regards,
Miten