Roadmap
darsnack opened this issue ยท 11 comments
Prompted by discussions on Discourse and Slack, I think we sorely need a roadmap issue for FluxML. This issue will serve as that roadmap; the hope is that we build it together. This roadmap is the BDFL (and bors
).
Some things are technical, some things are organizational. Feel free to suggest more tasks. If you think something should not be on the list, then suggest that too.
Governance
We don't seem to have a clear governance model. Officially, we follow ColPrac, and I think if we lean into it, it can be a sustainable model for the org. The contributing guide seems like the correct place to document this.
- More comprehensive contributing guide (#1824 seems good)
- List the "core maintainers" on the README so it's clear who to ping on Slack/Discourse?
- Agree upon some way to move passed blockers (e.g. a simple majority of maintainers are in favor of X)
- Bi-weekly calls are actually used for us to come together and work through issues (a stronger culture of communicating frequently would be nice too)
- Organize our issues and PRs so it is easier to contribute
Technical
This isn't meant to be a comprehensive list, but it should detail our top priorities for what to work on when we aren't squashing bugs.
- Make Flux AD-agnostic Chris's post if you need convincing; or search for "Zygote" and "bug" on any forum
- Better CI (i.e. benchmarking Metalhead.jl, downstream package testing, JuliaFormatter) We need to be able to trust the tools in order for PRs to be merged quickly and bugs to be caught before releases. The model-zoo frequently goes stale. Maybe it should be repurposed to be a set of examples + regression tests. This doesn't need to be a fancy solution with a webpage. Just a bot that responds to Github PR comments would be a huge improvement.
- Documentation overhaul We need to go through our docs from cover to cover and see if it makes sense and flows. Too many users walk away cause the solution to their problem is buried in bad documentation structure.
- Multi-GPU training There are a couple options right now, but we need to think about how we wrap it up in a user-facing API. Probably in a separate package from Flux itself.
- Pre-trained models This will likely require the previous point, but ONNX.jl is another promising option here.
- Make Flux explicit gradient-first This is a broad goal to capture many sub-tasks like getting Optimisers.jl to the state of release, figuring out tied weights, getting ready for Diffractor.jl, etc.
- Think really hard about RNNs
- Better benchmarking and performance improvements on baseline architectures This is quite a large and multi-faceted task, but things like very high memory consumption from which we suffer in some basic setting AFAIK, are showstoppers for training large vision and NLP models.
- Gradient correctness We should track down and fix all situations where Zygote silently yields incorrect gradients.
I explicitly started with a short list, because some of these tasks like CI need to be dealt with first before we can reliably tackle anything else.
I'd also ask that we try to be honest and constructive here. Nothing above is solved totally, so commenting what's left is more helpful than "oh that's not an issue because X does 70% of it and the rest is easy." If the rest is easy, then open a linked issue detailing what the rest is so that someone can tackle it.
Lastly, let's limit comments to Flux maintainers for the most part. Anyone is welcome to suggest stuff that we are missing of course, but it would be good to pre-start the list with comments from folks who have been working on the packages for some time.
@DhairyaLGandhi @CarloLucibello @mcabbott @ToucheSir @lorenzoh @ChrisRackauckas @logankilpatrick
with regards to
Better CI (i.e. benchmarking Metalhead.jl, downstream package testing, JuliaFormatter)
I think this might synergize really well with some form of ml leaderboard/benchmark. There are multiple datasets which can be trained with multiple models and those can themselves be trained with multiple optimisers. If the interfaces are well defined, you could just import datasets, models and optimisers and mix and match without having to change any code. This would allow us to create something like this: https://paperswithcode.com/sota/image-classification-on-mnist but instead of git repos, it would link julia packages with the models/optimisers. And you could simply install these models without having to adapt their code. At the same time all those models would act as a smoke test for changes to flux.jl. Simply install a bunch of model packages and see whether or not training still works.
And since those models are created by other people and only dynamically included for testing, they would not go stale like a manually maintained model zoo.
That being said, this is probably more of a long term thing and as you said: no fancy webpage neccessary. But we could keep that as a growth target in mind when coming up with solutions so that they can grow there.
I looked at different optimisers in my masters thesis and just started a PhD without a clear goal yet. And I think I want to continue to look at optimisers. For that I will need to benchmark them so I will probably reimplement some models from other people. Might as well do that for you if you want. I also have some experience with CI and testing, although not as much in Julia specifically. Would love to contribute something.
Yeah any help achieving these goals is greatly appreciated. Take a look at FluxBench.jl. It's probably a good starting point for this. Feel free to ping us on Slack or Zulip as needed as well.
#1866 has nothing to do with bors, there were failing tests and #1866 was trying to figure out why. It ultimately turned out to be inaccuracies in Cudnn. It seems if anything bors inadvertently stopped us from merging hacks which would have been brittle.
Simply install a bunch of model packages and see whether or not training still works.
That's part of reverse CI and benchmarking.
Ths issue with FluxBench is not technical, see FluxML/FluxBench.jl#15 where I am trying to run with newer Zygote and CUDA. It is straightforward to add a script which runs these on a cron job like basis, but that clogs up the benchmark queue. I actually recently spoke with @maleadt to see if it's alright to run it at a constant cadence, and it seems like we can do it within reason. You can see the errors upgrading zygote is producing.
Think really hard about Transformers and ViT too.
What are the particular issues with Transformers?
Great list, all those items are very relevant. I would add a couple of points:
- Better benchmarking and performance improvements on baseline architectures. This is quite a large and multi-faceted task, but things like very high memory consumption from which we suffer in some basic setting AFAIK, are showstoppers for training large vision and NLP models.
- Gradient correctness. We should track down and fix all situations where Zygote silently yields incorrect gradients.
As for the docs, it would be good to include some frequent questions and workaround from discourse, e.g. using gradients in loss function.
For the pre-trained models, maybe we can leverage some of the great work @dfdx is doing on ONNX.jl and get the weights and architectures from https://github.com/onnx/models. Also being able to load weights from HuggingFace for some relevant models would be great.
RE Hugging Face, I'd be remiss not to point out @logankilpatrick's tutorial and @chengchingwen's work on https://github.com/FluxML/HuggingFaceApi.jl :)
@DhairyaLGandhi Regarding JuliaBench.jl
and https://speed.fluxml.ai/, I'd be happy to take on the the duty of bringing the benchmarks on a Stipple.jl dashboard. I think this would help to bring a more Julian aesthetic and I would like to first focus on a lightweight visibility on both speed and memory consumption for both common layers and architectures (MLP, ResNet, RNNs...).
One improvement for Flux development would be integration with a comment bot so that the benchmarks can be invoked on PRs to see performance differences. This would require being able to invoke the benchmarks with user-specified versions for Flux, Zygote, NNlib, etc.
A Stipple dashboard would be awesome! The comment bot can be invoked with @ModelZookeeper and it can go to FluxBot.jl. Its set up and works with buildkite as well.
This issue is two years old. Can someone give an update about the current state?
The action items are broad enough that no one of them is complete. It's mostly incremental progress using the (very limited) time we have as maintainers. For example:
- Make Flux AD-agnostic: Any layer that doesn't have differing behaviour between train/test is AD agnostic now. For those that do, manually toggling modes still works. We even have some degree of support for Enzyme.jl now, which is nice.
- Better CI (i.e. benchmarking Metalhead.jl, downstream package testing, JuliaFormatter): Downstream testing is pretty pervasive but could be added to more of the smaller packages under the org (along with CI config harmonization in general). https://github.com/FluxML/FluxMLBenchmarks.jl was a pretty successful GSoC project but needs a bit more attention to get it over the finish line (contributions welcome). Those who've seen this year's GSoC page may have noticed that one of the projects touches on the point about Metalhead โplease spread the word. Getting JuliaFormatter across all repos may be the easiest win, but someone needs to do the work of testing it and finding a config that works for most maintainers. Not hard, but requires a bit of perseverance.
- Documentation overhaul: This has been mostly piecemeal and some structural changes were made for the better. Ideas on better handling docs are very welcome, with the caveat that I'm expecting complete PRs to come with them.
- Multi-GPU training: Data parallel is finally back on the table now that https://github.com/juliagpu/nccl.jl is back, and of course there's model/pipeline parallelism. This is mostly a "if you have this problem, help us fix" it kind of deal, because most contributors do not appear to need nor want multi-GPU support right now.
- Pre-trained models: Primarily relevant for Metalhead. We're mostly concerned with ensuring existing models with weights stay working, so extra help would be required to add weights for new ones. Thankfully the process is documented now, so have a look at that if you're interested.
- Think really hard about RNNs: We thought really hard, then came up with #2258, FluxML/Fluxperimental.jl#7 and #2316. At this point I think RNNs just need a champion who can push the implementation the rest of the way. With the design mostly figured out, what remains is to ship something we can kick the tires on.
- Better benchmarking and performance improvements on baseline architectures: The first part was covered by FluxMLBenchmarks.jl and I'll just point to what I said above about that. The second part is an open invitation for anyone who has an idea about perf improvements at the high or low level. For example, the CPU conv routines in NNlib could be optimized quite a bit more than they are now.
- Gradient correctness: Turned out to be a bit of a boil-the-ocean project, but what has been happening is incremental improvement of the Zygote testsuite to improve coverage of the functions it claims to work with. Offloading rules to ChainRules and other packages has helped quite a bit too here.
The overarching theme among these updates is limited capacity. I can think of many improvements made over the past couple of years would not have happened but for the effort of one, maybe two people outside of the core team. The reality of not being a funded project (small corp, big corp, academic or otherwise) is that contributions from the community are make or break when it comes to advancing the project beyond just keeping the lights on.
If you, the reader use Flux and are quite comfortable with Julia development, this is a call to action to help improve the ecosystem. Reach out to us on Slack, Zulip or Discourse in public or privately if the idea sounds interesting or if you have ideas of your own. Also, did you know we have a bi-weekly meeting on Fridays? Check it out on the community calendar. Until then, hopefully this update was helpful.