qiime2/q2-feature-classifier

`extract-reads`: add read orientation parameter

Closed this issue · 11 comments

Improvement Description
extract-reads should have a parameter to specify the orientation of reads (forward | reverse | both) or just autodetect the orientation by attempting to match primers to the reverse complement of reads that do not match on first pass.

Current Behavior
Assumes all reads are forward orientation. If databases are in mixed orientation, this can cause trouble (see forum xref below)

Proposed Behavior
Add parameter to specify read orientation. If read orientation is both, reads that do not match primers will be reverse complemented to attempt a second extraction.

This will have the added benefit of providing a way to homogenize mixed-orientation reads, which currently cause problems with classify-sklearn.

References
forum xref

Hello @nbokulich, has this issue already been resolved? I see that there is now an auto option for read_orientation based on confidence estimates for first 100 reads.

no, the read_orientation parameter is in classify_sklearn, this issue is to add one to extract_reads

@nbokulich should this look something like qiime2/q2-cutadapt#33

actually, it looks like checking both orientations has already been added, but not exposed as an option.

We could expose an orientation option, it could speed things up if someone knows their sequences are in a specific orientation. But we could also just close this issue down since the problem is solved (or maybe never existed and that forum user was just wrong?). On these lines:

amp = _exact_match(sequence, f_primer, r_primer)
if not amp:
amp = _exact_match(sequence.reverse_complement(), f_primer, r_primer)
if not amp:
amp = _approx_match(sequence, f_primer, r_primer, identity)
if not amp:
amp = _approx_match(
sequence.reverse_complement(), f_primer, r_primer, identity)

we could complicate the if statements, something like:

    if orientation in ['forward', 'both']:
        amp = _exact_match(sequence, f_primer, r_primer)
    if not amp and orientation in ['reverse', 'both']:
        amp = _exact_match(sequence.reverse_complement(), f_primer, r_primer)
    if not amp and orientation in ['forward', 'both']:
        amp = _approx_match(sequence, f_primer, r_primer, identity)
    if not amp and orientation in ['reverse', 'both']:
        amp = _approx_match(
            sequence.reverse_complement(), f_primer, r_primer, identity)

any thoughts?

If I recall correctly, the primary reason @thermokarst had me control the behavior in qiime2/q2-cutadapt#33 with a parameter instead of just making it happen automatically was performance. I don't know how the runtime of this method compares to the runtime of the affected method in qiime2/q2-cutadapt#33, but if you think it could benefit from some amount of speedup we could add an orientation parameter that would presumably default to 'both' to maintain current behavior and allow the user to choose 'forward' or 'reverse' if they do happen to know for certain they have a particular orientation.

Hey there @Oddant1! The q2-cutadapt feature you have linked to is a little bit different --- it is for working with SampleData[PairedEndSequencesWithQuality] that contain mixed orientation reads. That feature doesn't allow users to specify the orientation as being forward or reverse, it just allows for the two-pass removal of mixed-orientation. I think that is a bit different than what is being discussed here - is that the case, @nbokulich ?

you are correct @thermokarst , the need here is different.

if you think it could benefit from some amount of speedup we could add an orientation parameter that would presumably default to 'both' to maintain current behavior and allow the user to choose 'forward' or 'reverse'

yes it would probably really increase runtime here. extract-reads can be painfully slow, so specifying the orientation could potentially cut runtime in half.

It would be really easy to implement — I think the example I gave above is all that needs to be done, other than exposing the option — so maybe just implement and then run a tiny benchmark before we merge?

(on another topic, extract-reads could probably be turbo-charged if we use a different method for primer alignment; maybe something like vsearch, which is written in C++ and already has an orientation parameter, but that is only tangentially related to this issue)

Ok, sorry everyone. I missed this issue when it first appeared. I should have jumped on it then. Sorry especially to @Oddant1.

Do we really want to implement this feature? Most of the reference databases I have seen have (intentionally or otherwise) mixed orientation reads in them. Giving users the option of doing something that is fast but wrong could be dangerous.

With regard @nbokulich‘s apparently deliberately off-topic regarding switching aligners, my personal belief is that we should split read extraction into its own plugin, perhaps wrapping one of the existing PCR simulation packages that do this much better than we do (I did?). My current favourite is ipcress. I think it’s got the wrong sort of license but my goodness it is fast, and feature-rich.

pending @Oddant1 's benchmark, #149 has the potential to cut runtimes significantly (when a user knows the read orientation). So yes, I think this is a worthy endeavor even if we eventually replace with a faster aligner.

Do we really want to implement this feature? Most of the reference databases I have seen have (intentionally or otherwise) mixed orientation reads in them. Giving users the option of doing something that is fast but wrong could be dangerous.

The default should certainly be "both" to retain the current behavior (as @Oddant1 has done in #149 ). But yes, I think we want to give users the option if they know the orientation of their reads. We could raise a warning if you are really concerned.

Perhaps we should put the inverse warning in. Collect stats if the user uses both, and if we don’t find any reverse-oriented sequences, warn the users that they are wasting precious clock cycles :)

"both" keeps things running as they currently are — and the RC is only checked if the primer does not match the read in its original ("F") orientation. So collecting stats and warning would do little help.

if orientation!='both' and the primers do not hit the first N reads (100?), extract-reads could exit early and raise an error Primers do not match the first 100 reads in {orientation} orientation. Are you sure your reads are in the correct orientation? Please confirm the read orientation, set the "check_orientation" parameter to False or use "orientation=both" to proceed.