Workflow to extract chunks for a randomer dataset

Question

Workflow to extract chunks for a randomer dataset

Closed this issue 7 months ago · 1 comments

Hi Marcus,

We have finally managed to get a dataset that has good coverage of randomers with and without 8-oxoG at the center surrounded by 4 random bases on either side. I would like to train a model on this dataset by extracting 5-mer chunks and would like your help with extracting these chunks from my dataset.

Do I start off my trimming the reads so I isolate the randomer by itself and segment/extract 5-mer chunks out of each 9-mer? or is there a better way to use remora to do this?

Thanks,
Mohith

Answer 1 · 2024-05-01T04:38:31.000Z

Remora does not directly support randomer processing. Randomer processing is quite a bit more involved and thus has been stored in the Betta repository. I would recommend contacting technical/customer support in order to apply for access to Betta.

At a high level though, 5-mers are not likely to be a large enough random context to train a robust model. Remora does not extract chunks of fixed sequence length, but instead extracts fix signal length chunks. These thus contain variable widths of sequence and the constant sequence outside of your randomer would then be included in may chunks. Applying this model to a new chunk of data without the same context may have unexpected results. We would recommend at least 20 and ideal >40 bases of random bases around the focus base of the randomer.

I hope this helps a bit, and would be happy to help further if you are able to gain access to Betta.