yachielab/Interstellar

Is it possible to interpret this read structure I have ?

Closed this issue · 4 comments

primers

I want to deduplicate my data off the long-read ONT platform using the following read structure. The closest tool I can find is handling 10x libraries on long-read platforms but in that case, they have a defined structure at one end of the read to look out for. In my case, I want to be able to use both the 5' and 3' UMI in combination.

kijiy commented

Hi, a regular expression pattern like
^(?PCTACT...)(?P.{25})(?PGTGG...)(?P.+)(?PGCATC)(?P.{15})(?PGGCT...).*$
should be able to extract your sequence segments. This website (https://regex101.com/) is useful for testing the regular expression pattern (specify 'python'). In addition to the default python regular expression, you can use 'fuzzy matching' regular expression patterns to allow constant sequences to have mismatches/indels.

Note that it might take time to process long-read sequences.

Sorry I am not familar with such pattern recognition:

I tried the pattern you recommened at the website and it gives error


"^(?PCTACT...)(?P.{25})(?PGTGG...)(?P.+)(?PGCATC)(?P.{15})(?PGGCT...).*$"gm

All the errors detected are listed below, from left to right, as they appear in the pattern.
(? Incomplete group structure
) Incomplete group structure
(? Incomplete group structure
) Incomplete group structure
(? Incomplete group structure
) Incomplete group structure
(? Incomplete group structure
) Incomplete group structure
(? Incomplete group structure
) Incomplete group structure
(? Incomplete group structure
) Incomplete group structure
(? Incomplete group structure
) Incomplete group structure
kijiy commented

I'm sorry, it seems the group name had disappeared.

This is the correct regex pattern, but the exact same pattern wouldn't work since I didn't write down all the fwd/linker/rev sequences.
^(?P<fwd>CTACT[rest of the fwd sequence])(?P<umi1>.{25})(?P<linker5>GTGG[rest of the linker sequence])(?P<DNA>.+)(?P<constant>GCATC)(?P<umi2>.{15})(?P<rev>GGCT[rest of the rev sequence]).*$

Replace the [rest of ...] block with the correct sequence. I would recommend you first play around with regular expressions on the website I shared above! Also, the regex library manual would be helpful https://pypi.org/project/regex/.

I hope it helps.

Thank you very much for the help.