BioJulia/BioSequences.jl

New Feature: Molecular weight calculations for BioSequences

benjaminlozanow opened this issue · 8 comments

Description: I propose adding a new function molecular_weight() to the BioSequences package that calculates the average or monoisotopic molecular weight of a protein or nucleic acid sequence. The function should take several optional parameters, such as double_stranded, circular, and monoisotopic, that allow users to customize the calculation.

Problem: Currently, BioSequences or any other BioJulia package does not provide a built-in way to calculate the molecular weight of a sequence. This is an important feature for many bioinformatics applications, and having it included in the package would be a valuable addition.

Code: I have written the following code to implement the function based on how the implementation was done in BioPython. The code includes support for calculating the molecular weight of RNA, DNA, and amino acid sequences. It also includes tables of weights for the different types of nucleotides and amino acids. I was thinking of adding the feature into longsequences/calculations.jl.

Let me know if it is something worth to pull request.

Regards,
Ben

@BioJulia/members, I wonder if it would make sense to put the weights that directly correspond to symbols in the BioSymbols package and the backbone and summation bits here in BioSequences?

Is there some advantage to include this in this package, as opposed to making a small purpose-built package? Might be cool to take advantage of Unitful.jl, but I don't know that we'd want to take on that dependency if this is the only functionality where it would be worth it.

To be clear, I think such functionality absolutely makes sense in BioJulia, just wondering about maintainability and flexibility. On the one hand, this functionality could be pretty straightforward with just a lookup table and multiplying monomer masses by counts, but I can also imagine a lot of complexity (eg post-translational modifications, DNA methylation, etc) that might be desired. Having a separate package might make it a bit easier to tinker, and we could certainly put some downstream tests here to make sure that API changes aren't breaking (or are addressed if they are).

Thanks for your comments!

I think what would be worth implementing in BioJulia/BioSequences are utilities when working with sequences such as RNA translation (already implemented in BioSequences), molecular weight, melting point, isoelectric point, GC content, charge given a pH calculations, etc that are useful when working with sequences and are found on python and R packages so an implementation in Julia seems natural.

For this, as Kevin pointed out, a separate package might be better for maintainability and keeping things organised. Although it might be confusing because it would be a package of functions applied to BioSequence but not inside BioSequences.

Let me know what you think, either way I'll be happy to contribute to this project.

I wouldn't be against including all these things in BioSequences, if there is a single "correct" way of implementing it. But I suspect properties like melting point is not so straightforward, and may be better suited in its own package - we could just make BioSequencePhysChem.jl or whatever. @benjaminlozanow how straightforward are these computations?

For molecular weight, there certainly is straight forward. We'd need to make some simplifying assumptions, such as some standard molecular weight of biological symbols (i.e. no weird isotopes, no modifications), but otherwise it seems good.

Perhaps we could implement molecular weight in BioSymbols, and then use those weights in BioSequences? If I'm not mistaken, the sequence weight is just the sum of the individual weights, minus two phosphates for nucleotides, and one water molecule for amino acids.

I think the best way forward is to make a proof-of-concept package without registering it in the General Registry. Then we can judge better whether it should be its own package, or migrated to BioSymbols/Sequences

I think what would be worth implementing in BioJulia/BioSequences are utilities when working with sequences such as RNA translation (already implemented in BioSequences), molecular weight, melting point, isoelectric point, GC content, charge given a pH calculations

Hmm, some of these I think make sense in BioSequences, and others not. Not totally sure what the principle I'm using though - for example GC content and translation seems like yes, the others less so. Maybe it's something about analyzing the sequences as data vs analyzing them as physical molecules?

Although it might be confusing because it would be a package of functions applied to BioSequence but not inside BioSequences.

This is definitely a worthy consideration - it would technically be what we call "type piracy" in julia-land. But given that we could keep in the BioJulia Org, do downstream tests, and mention it in the BioSequences docs, I think that this is fine.

But I don't feel super strongly, and if @jakobnissen is alright with it, I can get on board. I do think the idea of writing up a package and then pulling stuff in later is sensible, and makes either approach easy.

This feature sounds very cool.

I've noticed that there are some Bio packages scattered around GH that are pretty nice, here is a package that might intersect on some of the points of the discussion related to physicochemical properties implementations and the use of Unitful.jl: ViennaRNA.jl

Perfect!

Thanks for your comments, so what I gathered is the following "hypothetical" implementation.

BioSymbols/BioSequence --> molecular weight
BioSequence --> GC content
New package --> Physical-Chemical properties

I could start working on these if it's ok with you and then we see how to merge it.

Hi all, was the molecular_weight() feature ever implemented?