BioJulia/BioSequences.jl

CenteredSequence{A} type for comparing short reads padding with AA_Gap

gszep opened this issue · 8 comments

gszep commented

Sometimes it makes sense to begin indexing a sequence from its center. For example with hyper-variable loop regions for T cell / B cell receptors or peptides presented by antigen presenting cells. These are short amino acid sequences (lengths<20) whose start and end sequences are structured and center sequences are highly variable. Therefore I propose a CenteredSequence{A} type

Expected Behavior

  • Centered indexing mimicking behavior of OffsetArrays.jl
  • Return AA_Gap instead of out of bounds error, making it easier to align sequences of variable lengths
x = CenteredSequence{AminoAcidAlphabet}("CASSLASGSTGELFF")
x[0] # AA_G
x[-3:3] # LASGSTG
x[0:9] # GSTGELFF--

Possible Implementation

struct CenteredSequence{A <: Alphabet} <: BioSequence{A}
    data::Vector{UInt64}
    len::UInt

    function CenteredSequence{A}(data::Vector{UInt64}, len::UInt) where {A <: Alphabet}
        new{A}(data, len)
    end
end

function CenteredSequence{A}(it) where {A <: Alphabet}
    x = LongSequence{A}(it)
    CenteredSequence{A}(x.data, x.len % UInt)
end

function CenteredSequence{A}(::UndefInitializer, len::Integer) where {A<:Alphabet}
    if len < 0
        throw(ArgumentError("len must be non-negative"))
    end
    return CenteredSequence{A}('-'^len)
end

Base.length(x::CenteredSequence) = x.len % Int
BioSequences.encoded_data_eltype(::Type{<:CenteredSequence}) = UInt64
Base.copy(x::CenteredSequence) = typeof(x)(copy(x.data), x.len)

Base.eachindex(x::CenteredSequence) = range(firstindex(x),lastindex(x))
Base.firstindex(x::CenteredSequence) = 1 - trunc(Int,ceil(length(x)/2))
Base.lastindex(x::CenteredSequence) = trunc(Int,length(x)/2)

@inline function BioSequences.extract_encoded_element(x::CenteredSequence, i::Integer)
    bi = BioSequences.bitindex(x, i % UInt)
    BioSequences.extract_encoded_element(bi, x.data)
end

@inline function BioSequences.bitindex(x::CenteredSequence, i::Integer)
    N = BioSequences.BitsPerSymbol(Alphabet(typeof(x)))
    BioSequences.bitindex(N, BioSequences.encoded_data_eltype(typeof(x)), i)
end

Base.@propagate_inbounds function Base.getindex(x::CenteredSequence, i::Integer)
    firstindex(x)  i  lastindex(x) || return gap(eltype(x))
    data = BioSequences.extract_encoded_element(x, i+1-firstindex(x) )
    return BioSequences.decode(Alphabet(x), data)
end

Base.@propagate_inbounds function Base.getindex(x::CenteredSequence, idx::AbstractVector{<:Integer})
    isempty(idx) && return empty(x)
    v = map(i->x[i],idx)
    return typeof(x)(v)
end

Base.@propagate_inbounds function Base.getindex(x::CenteredSequence, bools::AbstractVector{Bool})
    @boundscheck checkbounds(x, bools)
    v = map(i->x[i], eachindex(x)[bools])
    return typeof(x)(v)
end

@assert BioSequences.has_interface(BioSequence, CenteredSequence{AminoAcidAlphabet}, [alphabet(AminoAcid)...], false)

However there seem to be issues with this implementation
Edit: fixed in BioSequence v3.0.2

x = rand(["WQERWER","PIOUIOP","CASSAKL"],12)
length(unique(x)) # 3 as expected

x = map(LongSequence{AminoAcidAlphabet},x)
length(unique(x)) # 3 as expected

x = map(CenteredSequence{AminoAcidAlphabet},x)
length(unique(x)) # unexpected behvaiour

Dear @gszep

Interesting idea. I'm sure CenteredSequence would be practical in some scenarios like the one you describe. However, from a practical point of view, I am nearly certain implementing CenteredSequence would be an utter nightmare, and an unending source of bugs. This could be justified if CenteredSequence was absolutely critical for BioSequences.jl, but it's not. Therefore, I am against the proposal.

I don't like shooting down reasonable and well thought out proposals from users, so let me justify why I think CenteredSequence would be hard to get right.

The issue is twofold. Most importantly, it violates the very first point of the BioSequence interface, namely:

Its [i.e. BioSequences] subtypes are characterized by:

    •  Being a linear container type with random access and indices Base.OneTo(length(x)).

When we wrote this interface, we did think about whether it was a reasonable assumption. In particular, we thought of circular sequences like plasmids. However, we ended up accepting this limitation, because assuming fixed linear indices makes everything much simpler.

More fundamentally, when implementing any kind of abstract type, there must be a concept of an interface, which makes a set of assumptions that are not violated. Julia has historically been very bad at defining and enforcing these interfaces, which is part of the reason the language is so unstable. We need to do better than what we do already. You mention OffsetArrays, which has recently gained infamy due to its buggy behaviour - see https://yuri.is/not-julia/ , precisely because most AbstractArray code assumes its indices are 1:length(x). It'd be even worse for CenteredSequence, since this assumption is written down in the docstring of BioSequence!

The second reason is, even if 1-based indexing was not part of the BioSequence interface, it's assumptions like that that enables generic code in the first place. Right now, we can implement most methods on BioSequence, because its behaviour is reasonably well defined. If we could not assume sequences length, or that the length is even finite, or how boundschecking works, so much generic code goes out the window.

Now, sometimes, generic code can be re-introduced by introducing new abstractions (similar to Base's nextind, prevind etc), which themselves make assumptions about behaviour. And sometimes, these new abstractions are worth it! BioSequences uses lots and lots of these internally. But every abstraction has the risk of being leaky, of being inefficient, or of not covering edge cases. For example, how do you get an iterator from the 10th element to the 5th to last element in an array where you don't know the indexing scheme? If you know it uses conventional indexing, it's trivial: 10:lastindex(x)-5.

And I think it would be particularly hard to come up with good abstractions for a BioSequence type which also includes CenteredSequence.

Again, I think your proposal is eminently reasonable, and I get its advantage when working with peptides presented by MHC and such. I just don't think we can actually implement it in a fashion that doesn't cause a lot of grief and pain and bugs down the road.

I'll keep this open in case some other devs want to chime in.

By the way, the unexpected behaviour of unique is due to a bug I discovered a few days ago, #243 . It's fixed in the newly-released version 3.0.2.
Edit: Ah no, sorry, that bug is simply due to not having a fallback of hash for BioSequence until v3.0.2.

gszep commented

Thank you for your thorough answer! You've pointed me to issues on correctness in the Julia ecosystem that I did not know existed 😨 I completely agree with you about the can of worms CenteredSequence type might open

Ah, a fellow immunologist! Hello 👋 (I don't actually do much immunology anymore, but it's close to my heart).

One thing that @jakobnissen did not mention is that one of the things that's great about julia is that you can easily make a CenteredSequences.jl package that implements your proposed interface. You can even subtype off of BioSequence{A} as you did, but I think it's important to be well aware of the caveats that Jakob mentions. And I think I agree that it doesn't make sense to include it in the main BioSequences.jl packages because of how difficult it would be to maintain. But if this is useful for your research, I'd encourage you to do it!

I think the correctness issues are not really as bad as that post makes out for the ecosystem as a whole (see this thoughtful but novel-length discussion), though when you're pushing the boundaries of functionality, it's important to keep in mind. Extensive testing is key!

I have a current need for this functionally.

@jakobnissen, @kescobo, and @SabrinaJaye, I wonder whether it is more reasonable to flip the responsibility of @gszep proposal and simply allow something like the following to work?

using BioSequences
using OffsetArrays

offset = -4000:4000

seq = randdnaseq(length(offset))
oa_seq = OffsetVector(seq, offset)

Currently, with BioSequences v3.1 the code errors with the following.

ERROR: MethodError: no method matching (OffsetVector{T} where T)(::LongSequence{DNAAlphabet{4}}, ::UnitRange{Int64})
Closest candidates are:
  (OffsetVector{T} where T)(::AbstractArray, ::Any...; kw...) at ~/.julia/packages/OffsetArrays/80Lkc/src/OffsetArrays.jl:207
Stacktrace:
 [1] top-level scope
   @ REPL[7]:1

So, for this functionality to work, BioSequence would need to subtype AbstractArray. But, I last recall there was some discussion as to whether BioSequence should subtype from AbstractString or AbstractArray. Now that there appears to be a case to subtype from AbstractArray, what are people's thoughts?

#173 is where that other discussion is. @jakobnissen was the primary detractor, though it seems like @SabrinaJaye concurred. I don't have a good sense of the maintenance burden, but I still don't see the benefit over making an OffsetBioSequences.jl or something. I'm quite wary of bringing in the added maintenance burden, when a separate package (that can be independently developed or bitrot as needed) that can add the OffsetVector(seq...) method. It's maybe a bit of type piracy, but 🤷

And there's probably a better way that doesn't do type piracy but can freely convert back and forth. Anyway, I say I'm concerned about maintenance burden, but I'm not the primary maintainer here, so please hold my objection lightly.

Right, so I still stand by my arguments in #173 - subtyping AbstractVector won't do us any favors in the long run. And to be honest, I think that OffsetArrays is an excellent warning against what happens if you create a subtype which violates fundamental assumptions about its supertype.

To implement CenteredSequence, the way I recommend would be:

  • Make a new package for it, which depend on BioSequences
  • Don't subtype from anything. Instead, just patch any internal and external functions from BioSequences which throws / is buggy on CenteredSequence.

There are some design issues that you might want to consider. For example, should eachindex be supported? If so, what should it return, if there are infinite valid indices? Same with firstindex, lastindex. What should iterating over the sequence return? It can't return the elements in order since they "begin" with an infinite series of gaps.

Oh darn, I misspoke. I don't need the gap padding, just the offset - so no violation of assumptions as suggested.

As an aside, with the responsibility inverted, the CenteredSequence functionality might be composed in the following way.

using BioSequences
using BioSymbols

using OffsetArrays

"Unoptimsed PaddedSequence"
struct PaddedSequence
	seq
end

"Unoptimsed getindex method for PaddedSequence"
function Base.getindex(padded::PaddedSequence, i::Int)
	if firstindex(padded.seq)  i  lastindex(padded.seq)
		return getindex(padded.seq, i)
	end
	return gap(eltype(padded.seq))
end

"Unoptimsed getindex method for PaddedSequence"
function Base.getindex(padded::PaddedSequence, r::UnitRange{Int})
	f = Base.Fix1(getindex, padded)
	return f.(r)
end

Checking the padded sequence.

julia> seq = dna"GATC" |> collect
4-element Vector{DNA}:
 DNA_G
 DNA_A
 DNA_T
 DNA_C

julia> pseq = PaddedSequence(seq)
PaddedSequence(DNA[DNA_G, DNA_A, DNA_T, DNA_C])

julia> pseq[-1:6]
8-element Vector{DNA}:
 DNA_Gap
 DNA_Gap
 DNA_G
 DNA_A
 DNA_T
 DNA_C
 DNA_Gap
 DNA_Gap

Now using Vector{DNA} as a standing for BioSequences subtyped from AbstractArray.

seq = collect(seq)
4-element Vector{DNA}:
 DNA_G
 DNA_A
 DNA_T
 DNA_C

julia> offset = -1:2
-1:2

julia> oa_seq = OffsetVector(seq, offset)
4-element OffsetArray(::Vector{DNA}, -1:2) with eltype DNA with indices -1:2:
 DNA_G
 DNA_A
 DNA_T
 DNA_C

julia> oa_pseq = PaddedSequence(oa_seq)
PaddedSequence(DNA[DNA_G, DNA_A, DNA_T, DNA_C])

julia> oa_pseq[-4:4]
9-element Vector{DNA}:
 DNA_Gap
 DNA_Gap
 DNA_Gap
 DNA_G
 DNA_A
 DNA_T
 DNA_C
 DNA_Gap
 DNA_Gap

I don't think eachindex, firstindex, or lastindex make much sense for a PaddedSequence.