Train my own data
Closed this issue · 3 comments
Hello remora team:
I have read the readme, remora can be trained with its own data. I want to know how should I process my sequencing data if I want to train a 5mc methylation model. Do I need to methylate the position of C? Do I need to train the remora model at the same time? Thank you very much.
The test data included with Remora is intended for simple testing purposes and not for actually training high quality modified base calling models.
For training your own modified base models, the key information required from the data is a known modification status at reference positions (or motifs). Could you provide some details on the data you intend to use for model training?
For your final question, could you re-phrase this? I do not understand what is meant by "train the remora model at the same time".
Thank you very much for your reply.
I now have two human samples with the same sequence. One is fully demethylated, and the other is untreated. I can sequence them with MinION. What should I do next?
The sentence,"rain the remora model at the same time", in my question was originally to be written in the issues of Dorado github repository, please ignore it, sorry.
In order to train a Remora model the modified base status of each training chunk must be known. The remora dataset prepare function takes a --focus-reference-positions
argument which specified which positions within your reference should be taken. Note that all reads covering these position should be modified or canonical depending on whether you are creating the modified or canonical dataset (see --mod-base
and --mod-base-control
options to remora dataset prepare -h
).
If you do not have a ground truth for the modified base content of your training reads then you will not be able to train a Remora model from this data. I hope this helps and please respond with any specific further questions you might have.