lenaschimmel/sc2rf

Q: Why show all donors not just the relevant ones?

corneliusroemer opened this issue · 6 comments

I'm analyzing one sequence and am wondering why you output all potential donors/parents, not just the two that seem most relevant here: BA.1/21J?

image

Are my arguments wrong? When I reduce parents to 0-5, I get not output which is weird. Don't quite understand what's going on here.

Why do you count 2 intermissions longer than 2 here? All intermissions I see are <2? Or do I misunderstand intermission?

I can't see more than 2 red dots in sequence in green area, or more than 2 green dots in red area.

So don't know how you count 5 BP there

image

The first question is easy to answer: you used --unique 0 which basically means: Show me all donors which are probably responsible for at least 0 mutations, or even shorter: Show them all. I did not think about that possibility, and I just realize that --unique 0 might be equivalent to --force-all-parents which makes that option somehow redundant.

The second one… can I just say that my concept of "intermissions" is very weird and should probably be scrapped anyway?

But there's also an actual answer to your question: There are 10 intermissions detected, and if --max-intermission-count is not explicitly set, it defaults to 8. This means that 8 of 10 are exempt from the breakpoint calculation, the remaining 2 still create 2 breakpoints each, plus the one "actual" breakpoint makes 5.

Honestly, I'm not sure if this concept is broken, or if it is actually useful and just need better documentation and a more intuitive output.

I see, thanks for explaining. I took that query from your README, where it says use this to see more potential recombinants.

Max intermission count now also makes sense - now I understand how this came about. In this case I'd want different settings for both parameters :)

took that query from your README, where it says use this to see more potential recombinants.

Oops 😅 I will update the query in the README to --unique 1.

In this case I'd want different settings for both parameters

Which two parameters do you mean? Currently there's --breakpoints, --max-intermission-length and --max-intermission-count. What's missing?

The idea of a threshold of intermissions above which you count things as breakpoints is weird. I guess it's necessary to prevent messy sequences from having apparently low numbers of breakpoints...

It's just confusing that these things start counting as breakpoints. You could keep them as a penalty as if they were breakpoints - but not call them as such? Do you see what I mean?

I think I know what you mean - but I'm not yet convinced that it will be less confusing. I believe there must be a better solution.