fhcrc/mutgen

Mutually exclusive overlapping matches are mutated

cmccoy opened this issue · 5 comments

It looks like overlapping motifs can introduce mutations which should be mutually exclusive. For example, the spec (which I believe makes AAA mutate to AGA with very high probability):

|--------+-----------+-------|
|  motif | mut_index | G     |
|--------+-----------+-------|
|  AAA   | 1         | 0.99  |
|--------+-----------+-------|

Applied to:

>sequence
GAAAAG

Mutates the sequence to "GGGAAG".

I'm not sure why the first two A's are mutated - is mut_index 1-based?
If not, I'd expect the outcome to be either:

GAGAAG or GAAGAG, since a mutation in either of the two overlapping AAA 3mers breaks the motif for the other.

Hmm... the index is supposed to be zero based, so that shouldn't be
happening. And the tests I've done should catch any bad indexing in the
position that is supposd to mutated. So my guess is that the indexing
involved in building up the mutated sequence is what's to blame here. I
haven't done any testing on that yet.

Currently, the program only looks at the original unmutated sequence for
where it can induce mutations. I would like this to remain possible for
some of the simulations I'm doing, but I think we could add an option for
matching against the sequence that's being generated as you mutate. Erick
and I have also been chatting about using exponential clocks, and I think
that approach would avoid these issues, though admittedly with a bunch of
"extra stuff" going on.

Feel free to hack on this, but if you don't, I'll take a look first thing
on Monday.

On Sat, May 17, 2014 at 11:18 AM, Connor McCoy notifications@github.comwrote:

It looks like overlapping motifs can introduce mutations which should be
mutually exclusive. For example, the spec (which I believe makes AAA mutate
to AGA with very high probability):

|--------+-----------+-------|
| motif | mut_index | G |
|--------+-----------+-------|
| AAA | 1 | 0.99 |
|--------+-----------+-------|

Applied to:

sequence
GAAAAG

Mutates the sequence to "GGGAAG".

I'm not sure why the first two A's are mutated - is mut_index 1-based?
If not, I'd expect the outcome to be either:

GAGAAG or GAAGAG, since a mutation in either of the two overlapping AAA3mers breaks the motif for the other.


Reply to this email directly or view it on GitHubhttps://github.com//issues/1
.

Christopher Small
Systems Analyst Programmer - Matsen Group
Fred Hutchinson Cancer Research Center
csmall@fhcrc.org

I would like this to remain possible for
some of the simulations I'm doing, but I think we could add an option for
matching against the sequence that's being generated as you mutate

I'm not sure that's desirable if you're processing the sequence left-to-right. Given a motif, say, AA -> AG, AAA -> AGA and AAA -> AAG should be equally likely.

But if you're interested in modelling a single round of APOBEC mutation,
this is actually more accurate, because APOBEC "scans" along the sequence
as it mutates. I don't know how AID works though.

Still you make a good point, and I guess the bottom line is that this is
just further motivation for the exponential clock work Erick and I are
talking about.

There was an indexing bug with nonzero mut_index causing the incorrect output you observed. This has been fixed now (72694ae), so closing.

Let's loop @ematsen in on the other things (either IRL or another issue).

Thanks! ccing @matsen .