BioJulia/BioSequences.jl

add an `ispermitted` for Alphabets?

TransGirlCodes opened this issue · 3 comments

Can we add an ispermitted method for Alphabets?

So the background context for this is I'm doing some of the internals for Kmers.jl - iterators being the last thing to move over and make nice. An recall when kmers only supported 2 bit alphabets, iterating over kmers for the generic method where a Kmer and the sequence kmers are generated from having differing alphabets, then gaps or anything that is not in the 2 bit alphabet, would be skipped over. In constructors it would throw an error - makes sense - the user is asking to make a kmer from something very specific, whereas we want the iterator to just drop the weird symbols and keep going. With the encode and decode methods as they are, they throw, but for my purpose of writing Kmer.jl internals, it would be really useful to have encode and decode
basically work like encode(::A, x) = ispermitted(A(), x) ? unsafe_encode(A(), x) : throw_n_stuff, so I can use the ispermitted and unsafe_encode methods manually and have the iterator do the right thing rather than cause a throw.

I accept that there could be weird symbols. However, I'm not sure I understand the encoding and decoding of a weird symbol, the necessary constraints that would allow valid symbols to continue functioning, and what KMER is constructed from in your scenario. I think there are two construction scenarios, one from a structure with bit packing and one without (Vector).

I think there are more constraints to consider when there is bit packing. For example, would there be a constraint on unsafe_encode such that it must encode with the same number of bits? Though I assume, there are not necessarily enough bits to represent all the uniquely weird symbols that may be encountered. So I think this assumption pushes us towards having something in BioSequences that represents the presence of an invalid symbol, which would drop information. What do you think?

In terms of the skipping over that you spoke of, does that mean that the next valid symbol gets included in the KMER?

So if I understand it correctly, you want a tryencode(::Alphabet, x)::Union{eltype(A), nothing}? That sounds reasonable.
Should the title of this issue be changed?

If what you are literally asking is a way to check if some symbol is allowed in an Alphabet, this can already be done with x in symbols(A). This is not optimised, though, so this method could be possible if tryencode is not sufficient (although it really should be!)