BioJulia/BioSequences.jl

`ispalindromic` with respect to `reverse` or `reverse_complement`?

Closed this issue · 6 comments

I've been trying to fuzz the parsers of BioSequences a bit using Supposition.jl, making use of the very nice interfaces you have (I haven't found any actual bugs so far, and performance is absolutely stellar even for humongous inputs, good job everyone! :D) and came across this:

julia> ispalindromic(LongDNA{4}("AA"))
false

julia> ispalindromic(LongDNA{4}("AT"))
true

Now, I'm not a Bioinformatics guy so this may just be some insider knowledge I don't have, and the docstring of ispalindromic doesn't say, but which of reverse and reverse_complement is ispalindromic referring to when it says that the sequence is palindromic? The current implementation is consistent with reverse_complement, but not with reverse.

If this is just missing from the docs, something like this should clear it up:

diff --git a/src/biosequence/predicates.jl b/src/biosequence/predicates.jl
index 37c2e9d..672680b 100644
--- a/src/biosequence/predicates.jl
+++ b/src/biosequence/predicates.jl
@@ -61,7 +61,18 @@ end
 """
     ispalindromic(seq::BioSequence)
 
-Return `true` if `seq` is a palindromic sequence; otherwise return `false`.
+Return `true` if `seq` is a palindromic sequence with respect to `reverse_complement`;
+otherwise return `false`.
+
+```
+# with respect to `reverse`
+julia> ispalindromic(LongDNA{4}("CGGC"))
+false
+
+# with respect to `reverse_complement`
+julia> ispalindromic(LongDNA{4}("CGCG"))
+true
+```
 """
 function ispalindromic(seq::BioSequence{<:NucleicAcidAlphabet})
     for i in 1:cld(length(seq), 2)

This was found using the following fuzzing setup:

julia> using Supposition, BioSequences

julia> dna_string = Data.Text(Data.SampledFrom(('A','C','T','G')));

julia> palindromic_dna = map(dna_string) do s
           b = LongDNA{4}(s)
           b*reverse(b)
       end;

julia> @check ispalindromic(palindromic_dna);
┌ Error: Property doesn't hold!
│   Description = "ispalindromic"
│   Example = (AA,)
└ @ Supposition ~/.julia/packages/Supposition/KpGkN/src/testset.jl:292
Test Summary: | Fail  Total  Time
ispalindromic |    1      1  0.0s

the complementary construction with reverse_complement passes successfully (it tries 10_000 sequences by default, so that's why it takes almost a second to run through):

julia> palindromic_compl_dna = map(dna_string) do s
           b = LongDNA{4}(s)
           b*reverse_complement(b)
       end;

julia> @check ispalindromic(palindromic_compl_dna);
Test Summary: | Pass  Total  Time
ispalindromic |    1      1  0.8s

I've noticed that the fuzzer I've been using above is a bit too friendly, only producing inputs with even length. Adjusted for that, ispalindromic claims that inputs with odd length aren't palindromic either, no matter with respect to which reverse they are constructed:

julia> using Supposition, BioSequences

julia> dna_string = Data.Text(Data.SampledFrom(('A','C','T','G')));

julia> @check (s=dna_string, m=Data.Just("") | Data.SampledFrom(('A','C','T','G'))) -> begin
           b = LongDNA{4}(s)
           c = b*LongDNA{4}(m)*reverse_complement(b)
           event!("Complete sequence", c)
           ispalindromic(c)
       end;
Events occured: 1
    Complete sequence
        A
┌ Error: Property doesn't hold!
│   Description = "##SuppositionAnon#560"
│   Example = (s = "", m = 'A')
└ @ Supposition ~/.julia/packages/Supposition/KpGkN/src/testset.jl:292
Test Summary:         | Fail  Total  Time
##SuppositionAnon#560 |    1      1  0.0s

I'm not a Bioinformatician so I don't really know what the exact semantics of ispalindromic are and considering the current implementation is consistent with reverse_complement, this may be intentional. If I'm understanding this correctly, there are no self-complementary nucleotides, so with respect to reverse_complement, no DNA sequence of odd length can ever be considered palindromic, right? If that's true, it would probably be good to mention that in the docstring too. Something like this maybe:

!!! note "Odd length sequences"
     Since `ispalindromic` is defined with respect to `reverse_complement` and there are no self-complementary
     nucleotides, sequences with odd `length` are never palindromic.

Could also be an opportunity for some optimization, since the length is cached as far as I can tell.

You're right - palindromic sequences are defined in terms of reverse-complementation since this is more relevant biologically (nucleotide reversion doesn't occur in nature, but reverse-comeplementation happens all the time). It's also true that odd-length sequences can never be palindromic. I recall vaguely that this is leveraged in genome assemblers, that only use odd-length kmers, which simplifies the implementation of strand-specific assembly de Bruijn graphs.

A documentation update is in order. The current docstring is not very.. explanatory.

Ah, the tests has a counterexample with a palindromic odd-length sequence: ANT. This is because N is a degenerate base meaning "unknown" base.

Ah, the tests has a counterexample with a palindromic odd-length sequence: ANT. This is because N is a degenerate base meaning "unknown" base.

A case that my very naive fuzzing attempt with "ACTG" obviously doesn't hit, makes sense! Glad my naive fuzzing can help clarify that for either other newbies to Bioinformatics or forgetful people :D

good job everyone!

Let's be honest - good job @jakobnissen and @bicycle1885 :-P

Yes indeed - thanks too for the quick turnaround here!