rust-bio/rust-bio

Restructure rust-bio to speedup compilation and reduce dependencies

natir opened this issue ยท 24 comments

natir commented

Hi everyone.

Rust-bio have actually 28 dependencies, these dependencies have dependencies. All this stuff increase the compilation time.

In my usage and I think in the usage of many rust-bio users I didn't use all rust-bio features.

By using cargo features we can reduce the number of dependencies and the compilation time by build only what the user needs.

By default, all features will be activated.

To start I suggest the creation of one feature by each important modules:

  • io
  • alignment
  • pattern_matching
  • seq_analysis

But we can probably found some more cleaver split of rust-bio features.

I agree. The large dependency tree/compile time of rust-bio is the only reason I have released a couple of crates as standalone libraries instead of contributing them to rust-bio. Even if it probably limits exposure and thus usage of these libs.

Dividing into features is probably easiest. There are some cross usages in the modules you propose though (e.g. pattern_matching uses alignment). So it would require some refactoring.

More complicated would be a workspace with several libraries for the components. These would need to be separately published to cargo so an alternative would be to just split them into separate repositories under the rust-bio organization. Benefits will be improved compile times since the crate is the unit of parallelization.

But as we see interest from many directions from single cell to pdb keeping a single unit and forcing people to compile a fancy FM-index when all they want is to read a simple fasta file seems like a bad idea.

Looking through some of the current issues I realize this issue is probably a duplicate of #239.

Hi all, as heavy users of bio we've also felt the impact the large amount of code & dependencies. I have been considering proposing splitting bio into smaller crates for a while. The alternative of making a series of features should also be considered. We need a healthy discussion on this (and maybe some prototyping). @johanneskoester

Here's some pros/cons that people have mentioned here & in other forums:

Features:

  • smaller amount of change compared to current state ๐Ÿ‘
  • easier upgrade path. no new crate names to switch to ๐Ÿ‘
  • rogue crate can accidentally switch all features on ๐Ÿ‘Ž
  • doesn't parallelize build ๐Ÿ‘Ž
  • doc infrastructure for features isn't well developed ๐Ÿ‘Ž

Separate Crates:

  • docs & discoverability (separate crates.io page for smaller bites of functionality) ๐Ÿ‘
  • long term evolution (e.g. PDB, single cell, more fields of biofx) ๐Ÿ‘
  • parallel build ๐Ÿ‘
  • churn to upgrade downstream crates ๐Ÿ‘Ž
  • interoperability of different versions? harder to maintain this ๐Ÿ‘Ž

My initial feeling is that splitting into smaller crates is the right approach, mainly because I think that's what we will need long-term if we want to continue adding a broad set of features.

Thanks for the detailed summary of pros and cons. Looking at it, I also think that splitting rust-bio up into several crates under the rust-bio organization is probably the right way forward, eventually. In any case, splitting into features or crates will require some refactoring, e.g. the io module will probably have to be split across different topics or domains, so that reading and writing certain formats in your downstream code does not require compiling the reading and writing functionality for all other available formats. But in other cases (e.g. pattern_matching, alignment, seq_analysis), the current module setup can provide good indicators for splitting.

natir commented

Indeed, my features split isn't the most optimal.

I hadn't thought about the benefits of separating into smaller crates. And indeed it seems more interesting to me for the end-user. Even if I'm afraid that we'll find a level of granularity with one crate per function. And it more complicates way.

I would like to summarize the situation this way. The use of features is an hotfix to reduce the dependency tree and compilation time. Separating into smaller crates requires more time and preparation but is certainly more efficient in the long run.

I don't think that the compilation time problem is urgent so I think that splitting into smaller crates is a good idea. Even if we need to establish a long time plan !

vsoch commented

I'm in support of more modular development, for the reasons already stated and better ownership of managing issues, requests for changes, etc. I'm relatively green with rust, so please keep me in mind if there is an effort that needs to do work that a more experienced rustacean might consider arduous (e.g., splitting existing code into separate libraries) because I think I would learn a lot!

Hi guys,
my main concern is maintenance burden. Maintaining multiple crates will only work if this is all fully automated, and it is ensured that stuff stays in sync. Features seem like a reasonable fix to this issue, if they are easy enough to handle this.

But the subcrates don't have to be released synchronously. It seems that quite a few parts have not seen many changes lately and it might be possible to 1.0 these guaranteeing a stable API. Introducing a breaking change might be more difficult because you probably want to avoid requiring different versions in different subcrates. But on the other hand, if stuff is continuously being added to rust-bio it the library will never reach a stable version which might also scare potential users (although 10X doesn't seem bothered ๐Ÿ˜‰).

Actually looking at the cross usage of the currently 'defined' modules I created this list:

$ rg "use crate::" |perl -ne '/^(.*?)\/.*use crate::(\w+)::/; print "$1\t$2\n" if $1 ne $2' |sort |uniq
alignment	data_structures
alignment	scores
alignment	utils
data_structures	alphabets
data_structures	utils
io	utils
pattern_matching	alignment
pattern_matching	utils
stats	utils

So it seems utils is used ubiquitously which makes sense. Other usage also looks like mainly helper stuff.

Another idea for improving the compilation time is to avoid procedural macros. They're nice, but sometimes not essential. See salsa-rs/salsa#201 for an example.

Unfortunately, there might not be any easy wins here: I assume that regex and serde are essential, and neither snafu nor derive-new seem to be on a critical path in the build graph:

image

EDIT: regex actually isn't essential.

But the subcrates don't have to be released synchronously. It seems that quite a few parts have not seen many changes lately and it might be possible to 1.0 these guaranteeing a stable API. Introducing a breaking change might be more difficult because you probably want to avoid requiring different versions in different subcrates. But on the other hand, if stuff is continuously being added to rust-bio it the library will never reach a stable version which might also scare potential users (although 10X doesn't seem bothered wink).

This is a very important point and is one of the main reasons I think subcrates is a more scalable/preferable solution.

To be honest I now prefer not to use rust-bio where possible because I generally only want small pieces and it bloats my compile time.

I appreciate @johanneskoester point on maintenance burden, but I think the community has grown enough now that subcrates could be delegated to certain users who have contributed a lot or would like to continue to maintain certain functionality. We just need to make sure the maintenance guidelines are consistent across all rust-bio repos.

More structure in the rust-bio project sounds good to me, some quick thoughts:

  • central maintenance and contribution guidelines (consolidation and extension of what we have) with clear guidance of what belongs where (e.g. general types into rust-bio-types)
  • more automation wherever possible (release process?)
  • one or more people per subcrate that are responsible for maintenance and releases
  • keep global teams for review and merging, to keep the maintenance burden on individuals as low as possible

Also, @natir: Should we maybe rename this issue to something more general. Maybe "Restructure rust-bio to speedup compilation and reduce dependencies" or something similar?

natir commented

done

We're trying to organize a call about the future structure of the rust-bio organization / project with everybody who's interested over at Discord:
https://discord.com/channels/715534131760595015/715534131760595019/857323527769161759

So unless you have already seen it on Discord, please check it out. We'll definitely document the results here, but it would be great to have as many of you present for that discussion, as possible!

To get the ball rolling I split the submodules into features in #460, which can be the first step to splitting into subcrates in the future. Comments welcome =]

It looks like everyone is in agreement of splitting the crate up.

I don't think it's a good idea to try implementing features as a "middle step" (#460).

I do think it's important to have more discussion on how the smaller crates will look, e.g.

  • what still belongs in the bio crate
  • the proper scope of the smaller crates
  • naming
  • features on those smaller crates
  • documentation

This way, we can have better compilation times for our use cases and still have a library of bioinformatics tools :)

I like the idea of using features as a first step. The reason is the following: even after splitting up, I would like to keep the bio crate itself as a meta-crate that collects all of the other ones. This collection could then still be parameterized by the same features. The feature approach seems the most convenient path of transition for users. Those that don't care won't even notice, those that do care already now get the ability to have smaller build times. And then we can gradually start splitting stuff up into smaller crates if there really is a benefit of that.

For the latter, please help me, if we have the features approach, why would we need individual crates at all (not that I am against it, I just currently do not see the immediate benefit)?

Of course, we could offer other features than those currently suggested in PR #460.

@johanneskoester I get the goal to group under the rust-bio umbrella but IMO meta-crate is the wrong way to go.

It is more semantic to use separate crates. Separate crates also allows for separate versioning.

It will be a breaking change but worth it IMO. It is how tokio and serde do it.

@johanneskoester:

For the latter, please help me, if we have the features approach, why would we need individual crates at all (not that I am against it, I just currently do not see the immediate benefit)?

Functionality-wise it is kind of the same. Features are simpler to keep in sync (it is still only one crate/version being published), but due to how cargo does feature unification it might still end up compiling the whole of rust-bio if any other dependency also uses rust-bio and didn't subselect features.

Having separate crates allows picking only what you need and don't risk bringing other parts of rust-bio due to feature unification.

@allan2:

It is more semantic to use separate crates. Separate crates also allows for separate versioning.

But they require more work for releasing/keeping versions in sync. I agree this is the end goal, but need to figure out the process first (hence, start with features, then go for the end goal). What I don't want is increase the maintenance burden of rust-bio without having a way to help maintainers to keep everything running =]

It will be a breaking change but worth it IMO. It is how tokio and serde do it.

Tokio is a meta-crate with features: https://github.com/tokio-rs/tokio/blob/ee4b2ede83c661715c054d3cda170994a499c39f/tokio/Cargo.toml

@luiz10x Thanks for the correction! I was thinking of postgres and tokio_postgres, not tokio itself.

Hey, whatever it takes to get to separate crates faster. I'm happy to help out with maintenance.

@johanneskoester keen to hear your thoughts on separate crates as an end goal

Separate crates would in principle be fine for me, as long as the benefits are clear. Currently, the only benefit I see is that the build times become shorter (did I miss something in the arguments above?). The downside is that we would have to maintain multiple crates, keep their versions compatible, and so on. With keeping one repo and using release-please as now already, it seems like at least the release process stays about the same (apart from having one release PR per crate). Keeping stuff compatible between rust-bio, rust-bio-types, and rust-htslib already requires some additional work sometimes. I just fear that this will be a lot more if we end up having 20 sub-crates here or so. But if there is a fully automatic, release-please based solution that we now do not have to develop from scratch, I still think we could do it.

@allan2 if above questions end up being convincingly answered, I am happy to have rust-bio migrated. I guess an initial PR would suggest a separation into sub-crates in the same mono-repo, while keeping a main bio meta-crate. Would you be willing to create such a PR, so that we have something concrete to look at?

By separating bio into several crates you lower the threshold for new contributions. I mentioned before that I have released a couple of crates that mostly scratched my own itch as standalone on crates.io that might be useful for other rust-bio users. Others have probably done the same. These crates kinda lack exposure right now and I would be happy to move these crates under the rust-bio umbrella (if accepted). The developer gets access to better reviews, maybe some bug reports and it increases the bus factor which also helps adoption. Developer stays the preferred committer/maintainer of the repo although (a group of) the organization should also have access.

rust-bio should be the organization that enables these collaborations and this way you put the responsibility into the hands of more maintainers. It should not be the playground of a few individuals that happen to have merge/commit rights.

Personally I would advise against the meta crate. I would never use it even if it was available. Same reason you don't write use std::collections::*

Also rust-bio-tools? That is an odd collection of functionality and looks a lot like a personal project. I think only libs would be a better fit for this organization. It could be moved to another org, but it should be split at the least.

@veldsla I hear you, that is indeed a very understandable reasoning, and also kind of suggests to actually separate the repos, at least at some point, in order to better see where individual issues or PRs belong. Ok, maybe the meta-crate is indeed not needed. Since in Rust people usually pin the versions of dependencies, there is not really an issue if a certain crate suddenly does not receive updates. We could just have a final release of the bio crate that empties docs and code and adds a note that the bio crate has been split up, linking to the homepage, which should ideally contain an automatically updated summary of all crates under the hood.

Regarding rust-bio-tools, no this won't change. The main purpose of having everything in rbt under one single crate is that it becomes very easy and fast to quickly add a new functionality. Since this is not really meant as a crate that should be used by other rust packages, there is no need to split it up.

So, what remains now is to decide about into which crates exactly we should split up. I would propose something myself, but I am too busy at the moment. So if somebody wants to make a proposal here, please go ahead. The only rule would be that each resulting crate should start with the name "bio-".

Now that rust-bio is 1.0, is this plan stil alive?
A large part of my build time is waiting ~10 seconds for nalgebra to build, which is pulled in via:
bio -> bio::stats::hmm -> statrs -> nalgebra.

Other slow dependencies are:

  • nalgebra: 14s (via statrs, used for bio::stats::hmm)
  • regex-syntax 9s (via bio-types -> regex)
  • triple_accel: 9s (via bio::alignment::distance, for hamming/edit distance computations)
  • regex: 8s
    • via bio-types, in bio::annot::{contig,pos,spliced}
    • via bio::io::gff which contains one regex
  • simba: 8s (via statrs -> nalgebra)

My current project only depends on the triple_accel of these slow dependencies.

Putting some of these behind default-enabled feature flags sounds like a relatively straightforward way to get some improvements in a backwards compatible way.

Then as a second step the implementations could be migrated to standalone crates.

Anyway I think if we do this an incremental approach is the easiest to take, where we start by feature-gating a few big slow dependencies one at a time.