TuringLang/docs

Manifest.toml files are outdated

Closed this issue ยท 11 comments

The Manifest.toml files on the master branch still point to 0.32.0 of Turing, which leads to e.g. this UndefVarError: https://turinglang.org/docs/tutorials/docs-12-using-turing-guide/#maximum-likelihood-and-maximum-a-posterior-estimates

[[deps.Turing]]
deps = ["ADTypes", "AbstractMCMC", "Accessors", "AdvancedHMC", "AdvancedMH", "AdvancedPS", "AdvancedVI", "BangBang", "Bijectors", "DataStructures", "Distributions", "DistributionsAD", "DocStringExtensions", "DynamicPPL", "EllipticalSliceSampling", "ForwardDiff", "Libtask", "LinearAlgebra", "LogDensityProblems", "LogDensityProblemsAD", "MCMCChains", "NamedArrays", "Printf", "Random", "Reexport", "Requires", "SciMLBase", "SpecialFunctions", "Statistics", "StatsAPI", "StatsBase", "StatsFuns"]
git-tree-sha1 = "cfb3b446a5e52e1da4cc71b77a9350c309c581f0"
uuid = "fce5fe82-541a-59a6-adf8-730c64b5f9a0"
version = "0.32.0"

I notice that there used to be a weekly GitHub Action to update the manifests, which was removed as part of #441. So a quick fix would be to re-enable that workflow and run it.

However, I think a more robust solution for the long term would be to:

  1. Use a single environment for building all the docs, rather than one environment per notebook. I don't think there is any use case for having separate environments, and even if there were I think that it would not be the right thing to do โ€” it would be very confusing to the user if the code on separate pages could not be run in the same environment.

  2. Pin the version of Turing with a [compat] block in the Project.toml file (on both the master and backport branches)

  3. In the main documentation workflow (using on: push), add a check to make sure that the version of Turing specified in Project.toml matches the version specified in _quarto.yml

  4. Gitignore the manifest, only regenerating it when building the docs. IMO it makes more sense to let Julia resolve to the latest compatible versions (e.g. in order to use the latest bugfix version of Turing; and after all, a user reading the docs and running the code is, in all likelihood, going to fetch the latest versions anyway instead of looking for the manifest in this repo). Also, this spares us this issue of having to update the manifest when a new version of Turing is released.

Overall, this proposal means that when a new version of Turing is launched, only three files (_quarto.yml, Project.toml, and Manifest.toml) need to be changed; and GitHub will tell you right away if you forget to change one of them. (Of course, the list of old versions will also need to be updated, but that is a separate matter, unrelated to this issue.)

I've done (1), (2), and (4) in #497, happy to work on (3) and include it in that PR if this makes sense.

I've done (1), (2), and (4) in #497, happy to work on (3) and include it in that PR if this makes sense.

Thanks, @penelopeysm, for the nice improvements. Yes, please include (3) if you like in this PR. Alternatively, we are happy to address (3) in a separate PR.

As an improvement, it might make sense to implement a workflow in the Turing.jl repo, which would update this repo's Project.toml and _quarto.yml through a PR for new Turing releases.

I'm not sure about (1). The problem with a single environment is that it contains many dependencies, and in particular pulls in the SciML stack, even though for most examples only a few dependencies are needed. This bloated environment leads to unnecessarily long precompilation and it is easy run into version conflicts of dependencies, which holds back dependencies to unnecessarily old versions or make them completely incompatible. So the reason for the individual environments is basically the same as for the general recommendation in Julia to use project-specific environments instead of the global environment.

@devmotion Sorry for the slow reply! I've been travelling most of last week.

This bloated environment leads to unnecessarily long precompilation

That's what I had feared, but building the entire set of docs is actually faster with a single env. I couldn't tell you why, but here it is 1 hr 44 min with a single env and the last build on master with separate envs was 2 hr 8 min.

In the case where someone is developing locally and editing only one notebook, it's true that that precompilation will take longer, but I think that time is negligible compared to the time required to run the examples in most pages of the docs. Precompilation for the single large env takes 10 secs on my M1 MacBook Pro.

it is easy run into version conflicts of dependencies, which holds back dependencies to unnecessarily old versions or make them completely incompatible

Browsing the logs I don't think this is a major problem yet. I do recognise that it's a drawback!

The flip side is that having a single env ensures that there there is consistency between all parts of the docs. If each page has its own env, then the result of running the same function could potentially be different on different pages. I don't know if this is a problem in practice either โ€” I'd suspect not, but in principle I don't like the possibility of that.

the general recommendation in Julia to use project-specific environments instead of the global environment

I agree, but I think it comes down to whether we treat each individual notebook as a project, or the entire documentation as a project. I'd vote for the latter, but am open to being convinced otherwise.

In any case, I'll work on adding a check to ensure consistency between the quarto.yml and Project.toml (or Project.tomls if we stick to multiple envs).

That's what I had feared, but building the entire set of docs is actually faster with a single env. I couldn't tell you why, but here it is 1 hr 44 min with a single env and the last build on master with separate envs was 2 hr 8 min.

I assume this could be caused by different environments using different versions of the same package (which increases the number of packages that have to be precompiled). On CI, I think generally you would want to cache packages, artifacts, precompilation cache etc. (using e.g. https://github.com/julia-actions/cache) - then installation and precompilation should be much faster and only be triggered for new package versions.

I also lean slightly towards using a global environment for all docs. This forces us to think about what deps should be used and also maintain consistency of packages used across all doc pages.

cc @mhauru and @willtebbutt for more comments.

I don't have a strong opinion on this because I've not done it a lot. I'm also concerned about having a single massive environment, for the same reason as @devmotion, but I'm not opposed to giving it a go and seeing how it pans out (I can definitely see the advantages).

No strong preference from me either, I can see the benefits of both.

I don't know how Quarto works, but might there be a possibility to used a stacked environment? We would have one global environment that defines the Turing.jl version and maybe some other core packages, and then local environments that include e.g. tools needed only on a particular page. This might get overly fancy and complicated, depending on how easy it would be to get Quarto to do this.

might there be a possibility to used a stacked environment

One can specify environment variables and commandline arguments for the Julia process that is used for execution by the native Julia engine of Quarto, so stacked environments are possible. The fundamental limitation is though that there's no support for stacked Manifest files in Julia/Pkg (seems a bit weird anyway), so one would have to drop the Manifest.toml files completely and only work with Project.toml files (with compat entries probably). Removing the Manifest files and working with Project.toml files also has some advantages, so it's not a problem per se. One would have to think about how users can reproduce the code examples though since providing two Project.toml files doesn't seem very user friendly and the Manifest file simplifies reproduction.

Okay, the PR #497 now uses a single big env. It's just a PR, of course, so things can be changed if we decide against it; but right now I sense that the balance is very slightly in favour of single-env over multi-env.

I suggested gitignoring the Manifest.toml above, but I'm having second thoughts about that right now. Opinions on that bit are welcome! โ€” if we keep the Manifest.toml file to ease code reproduction, I would just have to add another check in the GHA to make sure that the version of Turing is also up to date with the latest GitHub release.

Generally, the pluses of keeping the Manifest:

  • easier to reproduce code examples
  • avoids rerunning entire code examples on github runners when unimportant dependency bumps happen

Downsides:

  • have to make sure it's kept in sync with other files, otherwise potential for it to be out of date (as this issue shows)

Equal:

  • have to regenerate docs when a new patch version of Turing is released. If the manifest is there, it has to be updated and a new commit pushed. If it's not there, the workflow still has to be manually triggered.

I suggested gitignoring the Manifest.toml above, but I'm having second thoughts about that right now. Opinions on that bit are welcome!

Let's keep the Manifest.toml file for better reproducibility. It is also relatively easy to manage now that we work with one single environment.