apache/airflow-site

Move out older versions of docs to a new store

Closed this issue ยท 17 comments

We recently ran out of space for the documentation build and we have a thread mentioning the same here https://lists.apache.org/thread/gh17psdn39s1o05lwxfqvvn5htjtqs05

Find out a way to move older versions of the documentation to a new place and redirect links to that store.

Previous context:
When we ran out of space last time, as an immediate measure, we removed the old versions from the docs-archive folder with PR #740, but it was later reverted #742 as we were able to claim some disk space for the time being with the help of PR #741

We're exploring an approach to move older versions of the docs to a newly created https://github.com/apache/airflow-site-archive repository (thanks to @potiuk for creating this repository) and we will keep those in the github-pages branch for the docs to be published. We need to figure out how to handle the redirects back and forth with respect to choosing the version from the drop-down.

When I do a ./site.sh build-site after following the steps outlined in README.md and installing brew install hugo, I get the below error and currently stuck at resolving this

WARNING in asset size limit: The following asset(s) exceed the recommended size limit (244 KiB).
This can impact web performance.
Assets:
  chunk-4.3d5f5.js (1.53 MiB)
$ cross-env HUGO_ENV=production hugo -d ../dist -s site -v
Start building sites โ€ฆ
hugo v0.111.3+extended darwin/arm64 BuildDate=unknown
INFO 2023/04/26 19:02:52 syncing static files to /
ERROR 2023/04/26 19:02:54 render of "page" failed: "/Users/pankajkoti/airflow-site/landing-pages/site/layouts/_default/baseof.html:23:7": execute of template failed: template: _default/search.html:23:7: executing "_default/search.html" at <partial "head.html" .>: error calling partial: execute of template failed: html/template:partials/head.html:15:17: no such template "_internal/google_news.html"
ERROR 2023/04/26 19:02:54 render of "page" failed: "/Users/pankajkoti/airflow-site/landing-pages/site/layouts/blog/baseof.html:23:7": execute of template failed: template: blog/single.html:23:7: executing "blog/single.html" at <partial "head.html" .>: error calling partial: execute of template failed: html/template:partials/head.html:15:17: no such template "_internal/google_news.html"
ERROR 2023/04/26 19:02:54 render of "page" failed: "/Users/pankajkoti/airflow-site/landing-pages/site/layouts/blog/baseof.html:23:7": execute of template failed: template: blog/single.html:23:7: executing "blog/single.html" at <partial "head.html" .>: error calling partial: execute of template failed: html/template:partials/head.html:15:17: no such template "_internal/google_news.html"
ERROR 2023/04/26 19:02:54 render of "page" failed: "/Users/pankajkoti/airflow-site/landing-pages/site/layouts/blog/baseof.html:23:7": execute of template failed: template: blog/single.html:23:7: executing "blog/single.html" at <partial "head.html" .>: error calling partial: execute of template failed: html/template:partials/head.html:15:17: no such template "_internal/google_news.html"
Error: Error building site: failed to render pages: render of "page" failed: "/Users/pankajkoti/airflow-site/landing-pages/site/layouts/blog/baseof.html:23:7": execute of template failed: template: blog/single.html:23:7: executing "blog/single.html" at <partial "head.html" .>: error calling partial: execute of template failed: html/template:partials/head.html:15:17: no such template "_internal/google_news.html"
Total in 2060 ms
error Command failed with exit code 255.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
error Command failed with exit code 255.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.

I think this is about bad hugo version - look at the CI of ours for the versions it uses, I think I had very similar issue when I tried to build the docs on Mac - and I could not solve it when I tried. One of the ways of solving it was to use a docker container based on debian to build the docs. You could potentially use Breeze image for it (if you for example check-out the sites in "files" folder) but it might have another problem it could suffer from - slow filesystem mounted to docker on Mac,

Going Linux/Debian first (maybe using remote build machine for it) is likely the fastest way to solve the problem.

There is also an idea to modernise our build pipeline for the sites - wich could solve the problem.

yes, looks like the template was removed https://discourse.gohugo.io/t/page-render-error/43594

Our CI seems to user Hugo version 0.91.2 as per https://github.com/apache/airflow-site/actions/runs/4253003040/jobs/7397288933, so I am installing that version now with
go install -tags extended github.com/gohugoio/hugo@v0.91.2 and see how it goes thereafter.

That helped @potiuk , thank you! The site was built and I can load the index page, however, the docs are not loading. Do I need to do something additional while building?
Screenshot 2023-04-26 at 7 37 56 PM

Screenshot 2023-04-26 at 7 38 06 PM

cc: @jedcunningham @phanikumv

I do not know the system that well, I think at least in some cases building the docs history has been skipped to save the size of generated images - look at the CI steps, I think there was even a comment about it

I got a chance today to read more about our setup in the repo and studied the CI build.yml, site.sh scripts.

Understanding so far

The build jobs create a dist folder when we run the ./site.sh build-site command. The size of the dist folder is roughly 10.3GB in the current main branch when I build it locally. There is a docs folder in the dist folder which itself occupies most of the space and it also reads at ~10.3GB at the moment. So all other directories occupy minimum space relative to the docs folder.

du on root folder
Screenshot 2023-04-27 at 7 44 06 PM

du on the dist folder
Screenshot 2023-04-27 at 7 44 30 PM

du on the dist/docs folder
Screenshot 2023-04-27 at 8 17 37 PM

Github runners guarantee that they provide at least 14GB for the runs actions/runner-images#2840 (comment).

What I understood is when we create a PR, in the line in our CI build, the docs folder is removed before proceeding to the next steps and as a result, the build job when creating PRs would hardly fail.
But when we try to merge the PR and merge it to the main this huge docs folder is not removed and when we tried to deploy the website here:

- name: ๐Ÿš€ Deploy website on asf-site branch
, the Deploy website on asf-site branch github action job failed with disk out of space issue while copying the dist folder to the gh-pages branch of our repository using the wrapper action apache/airflow-JamesIves-github-pages-deploy-action(The base action is https://github.com/JamesIves/github-pages-deploy-action).

I believe our website is deployed from the gh-pages branch and all the content that is available in it gets published as per my chat with ChatGPT :)

Solution Proposal (Theory)

We can replicate the setup including the CI and files/folder from this repo into https://github.com/apache/airflow-site-archive with the following tweaks.

  1. Split and copy a few sets of files from our docs-archive folder which gets translated to docs folder while building (occupying this huge space ~10.3GB) to the new repo with either of the below approaches:
    a. Keep certain providers in this repo and the rest providers in the new repo based on the sizing of the providers' wrt. to space they occupy as can be seen in the above screenshot for the dist/docs directory
    b. Keep all providers in both repos but split them by versions, meaning keep the latest versions here and the older versions in the new repo
  2. Have the site build / CI build only generate the dist folder with the docs we plan to keep in each repo.
  3. Set the target repository for the build job in the new repo to point to the gh-pages branch of this repo. Upon reading the options for the action, I believe, we can set the repository-name with the needed token in the new repo pointing to this repo.
  4. Ensure that
    CLEAN: true # Automatically remove deleted files from the deploy branch
    is set to False in this repo as otherwise the docs that are not in this repo but in the new repo will be cleaned out when CI is run in this repo on merge to main. Alternatively, set clean-exclude (again based on the options available in the GitHub action job) in this repo's CI build to not clean such files that are in the new repo.

With the above steps, I believe we will be able to have all the docs in this same repo's gh-pages branch and we would not need to worry about additional changes in the JS/CSS files of the repo, handling redirects, etc.

Next steps

The above proposal is all a theory based on my understanding so far and would like to hear opinions on this. Would like to hear if someone already knows whether this approach could make sense, is feasible/achievable or if we could sense some issues/blockers here.

Would really appreciate your time in reading this comment and would also appreciate if you have pointers on who we could reach out to more for seeking feedback/additional expert advice.

@potiuk @jedcunningham @phanikumv @mik-laj

Since build-site is copying stuff from docs-archive into dist, and we upload dist, can we not just delete docs-archive once we are done building? That'll give us 10gb of extra room without having any negatives or extra complexity?

ashb commented

Did we talk about keeping all the old versions somewhere else than the main branch (separated detached branches in this repo?)

Since build-site is copying stuff from docs-archive into dist, and we upload dist, can we not just delete docs-archive once we are done building? That'll give us 10gb of extra room without having any negatives or extra complexity?

Yes, @jedcunningham. I tried your suggestion and have created a draft PR.

I pushed 2 commits to display disk-free CLI command (df -h) output after each of the significant steps in our CI job (bf25b69 with the existing setup just adding a df -h and 6bc8ab7 after removing docs-archive directory to display df -h output).
CI job with first commit - https://github.com/apache/airflow-site/actions/runs/4830163412/jobs/8606043658?pr=777
CI job with second commit - https://github.com/apache/airflow-site/actions/runs/4830346989/jobs/8606461864?pr=777

I did a search in the repo and found there is no other reference to docs-archive after the site is built, so I believe we're safe to remove it from the CI job for the subsequent steps once the site is built. This step kept before the Deploy website on asf-site branch step will ensure (๐Ÿคž๐Ÿฝ due to +10GB reclaimed) that CI build will not fail as observed in this step last time due to no disk space available.

I am attaching the outputs (both PDF and PNG; you will need to zoom in unfortunately since I did a full-page capture) of the CI job with the 2 commits mentioned above.

PDF output:
first commit: before_pdf.pdf
second commit: after_pdf.pdf

PNF Output:
first commit:
Before
second commit:
after

If this solution is accepted, I guess this is a quick win for us for now :)

Did we talk about keeping all the old versions somewhere else than the main branch (separated detached branches in this repo?)

yes @ashb, thank you for your comment. Jed had suggested this idea too and then we decided to try the other repository approach first. I am sorry I don't remember the reasoning but maybe @jedcunningham can tell more about his thought process.

Yeah, that's more or less what I was thinking, but I hadn't quite connected the dots. The good news is I don't think we need to worry about any extra complexity now ๐Ÿบ.

Yeah. I like it too, though it would be great to modernize things a bit as well :) - but I agree if we can NOT involve another repo/redirection we seem to be good for now.

What we can do though - we could potentially rewrite the history for the whole airflow-site and keep maybe few last commits ? And we could do it periodically. That would leave us even more space I think and all the operations on the repo would be much faster (for now just getting my liquidprompt to show the version takes visible time.

I'd be 100% on board if squashing it all helps. It's become really painful!

okay, thanks a lot for the suggestions, feedback and go ahead! I will create a first PR with this approach of removing the docs-archive directory and we can iterate later again. I will check and try creating another PR later for squashing the commits (Keeping only the latest commits) as suggested by Jarek.

PR #777 was merged which reclaims us an additional 10GB+ space for the failing CI build job step, we do not need to move out the docs to the newer repo in the near future. Closing this issue for now.

@potiuk Can we keep the new repo or do we need to archive airflow-site-archive?