BioJulia/BioSequences.jl

Feature: concatenate bases with sequences

BioTurboNick opened this issue · 4 comments

dna"GTAC" * DNA_A
MethodError: no method matching *(::BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}, ::BioSymbols.DNA)

This operation can be done with strings and chars in Julia, seems like the analogous should be possible.

You can do push!(dna"GTAC", DNA_A) to do this.

More broadly, Base Julia tends to conflate containers of elements with elements:

julia> iterate(5) # scalars are iterable
(5, nothing)

julia> hcat([1], 1) # scalars are equivalent to 0-dimensional tensors
1×2 Matrix{Int64}:
 1  1

julia> eltype('a') # Chars apparently contain chars???
Char

I think this is a design mistake, and I'm skeptical of bringing it into BioJulia. That's why we have:

julia> iterate(DNA_A)
ERROR: MethodError: no method matching iterate(::DNA)

julia> append!(dna"TAG", DNA_A)
ERROR: MethodError: no method matching append!(::LongSequence{DNAAlphabet{4}}, ::DNA)

julia> dna"TAG" * DNA_A
ERROR: MethodError: no method matching *(::LongSequence{DNAAlphabet{4}}, ::DNA)

I agree that it's a design mistake, I'm slightly less skeptical of replicating it though. I think I'm on @jakobnissen's side, but could be persuaded. I do think there's some utility in matching the semantics of Base, even when they're probably not great.

That said, I think the desire to use append!() over push!(), at least for me, is just a hold-over from python. I definitely used append!() with scalars for ages thinking it was correct. So if this is an opportunity to educate users about the right functions to use, maybe that's a good thing.

Maybe having an operator that mimics the behavior. Currently it is possible to:

LongDNA{2}("ACGT") * LongDNA{2}("TGCA")
8nt DNA Sequence:
ACGTTGCA

What if we get an operator that sends the DNA_X to a LongDNA{T}([DNA_X]) and simply enable the concatenation of characters? Is this also a bad design?

You can certainly convert a biosymbol to a sequence - you just have to be explicit about it:

julia> dna"S" * LongDNA{4}([DNA_W])
2nt DNA Sequence:
SW