FAIRmat-NFDI/nexus_definitions

Repository extremely large

rettigl opened this issue · 22 comments

This repository has grown extremely large over time (>1GB of download for cloning).
It appears to be due to copies of files docs/mpes-refactor/.doctrees/environment.pickle with each more than 50 mbyte, tons of copies of pdf/NeXusManual.pdf, and docs/mpes-liquid/_downloads/0d9b3db52a075e9d9b6a1a0457a842ba/nxdl_vocabulary.json, so all old docs artefacts. Why are they part of the repository?

Check with:

git rev-list --objects --all | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | sed -n 's/^blob //p' | sort --numeric-sort --key=2 | cut -c 1-12,41- | $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

We (@mkuehbach and @sanbrock) inspected the situation and open the following discussion points:

  • Should be done but not the key problem why the repo is super large: cleanup old branches
  • CI action should be modified to enable proper documentation and docs cleanup for a given branch
  • Suggestion to split fairmat-docs off from nexus_definitions repo into an own repo, e.g. "nexus_documentation"
  • Adding of a CI action whereby content from nexus_definitions is pushed to that nexus_documentation and GH page properly serve
  • Modification of the CI pipeline to remove unnecessary cache files from sphinx i.e. ".docstree" and "_source" directories
  • Initial tests on a mirror clone of the nexus_definitions repo using a loop over branches with unnecessary sphinx content using git-filter-repo tool brought about a reduction from 1.25GB to s.th. in the corridor 150 - 200MB
  • Suggestion to remove PDF file and modify makefile accordingly to not run make complete but make local i.e. make prepare, make test, make html
  • Suggestion to eventually also remove the yaml files
  • Question remains when to do the maintainence and when to force push back
  • @domna shallow clone in the cleanup routine
  • @domna @sanbrock external repo
  • @rettigl we have a forked repo should check with NIAC on status so as to not pollute their branches or ours

One idea for having "non-history" branches:

I'm not sure if orphan is what we need. The github pages branch is actually an orphan branch, because it does not contain the history of the original repo. What we want is a branch which does not keep any history, just the latest state. Basically, we want a basic file system without any history 😅

Since this is an orphan branch we might also just able to delete this branch from the repo entirely and getting rid of the history. That way we wouldn't even need to rewrite the git history

Maybe we should just switch to https://about.readthedocs.com/?ref=readthedocs.org ? But afaik they don't have branch based deployment (but they version based on git tags).

I am wondering if we could just delete most of the content that sphinx generates before pushing to the fairmat-docs branch. For building the website, we basically only need the html files and the static assets, right? So e.g. .doctrees (which contains the largest files actually) could just be removed. This doesn't get rid off all contents in the git history going forward, but at least we don't get as much data in for each commit.

I am wondering if we could just delete most of the content that sphinx generates before pushing to the fairmat-docs branch. For building the website, we basically only need the html files and the static assets, right? So e.g. .doctrees (which contains the largest files actually) could just be removed. This doesn't get rid off all contents in the git history going forward, but at least we don't get as much data in for each commit.

Yes, this we should definitely do anyways. I actually thought this is what I did by just copying the build html folder (

folder: build/manual/build/html
) But it seems to also contains the build artifacts. I think the action has some cleanup option we just can use

Another option I just came across: https://github.com/peaceiris/actions-gh-pages has a force-orphan option. When we deploy, we could first delete the docs/old-branch folder for all the branches that are not active anymore (using git rm -rf docs/old-branch), add the new docs for the branch that we are working it, and then deploy to the fairmat-docs branch with the force-orphan option. Not sure if that would work though.

Another option I just came across: https://github.com/peaceiris/actions-gh-pages has a force-orphan option. When we deploy, we could first delete the docs/old-branch folder for all the branches that are not active anymore (using git rm -rf docs/old-branch), add the new docs for the branch that we are working it, and then deploy to the fairmat-docs branch with the force-orphan option. Not sure if that would work though.

This force-orphan is exactly what we want. Good catch! We don't even need to track the folders I think, because we also have a ci which deletes the old folders when the branch is deleted (but currently it stays in the git history of course). I think if we replace the current ci with this action and activate force-orphan we should be good for the future. Then we just need to solve how to remove the old branch with its entire history.

Since this is an orphan branch we might also just able to delete this branch from the repo entirely and getting rid of the history. That way we wouldn't even need to rewrite the git history

If all the large files really are only in one branch, wouldn't it be sufficient to reset this branch to it's first commit, add the latest version, and force-push to delete all the commits on this branch?

If all the large files really are only in one branch, wouldn't it be sufficient to reset this branch to it's first commit, add the latest version, and force-push to delete all the commits on this branch?

Yes this is kind of my idea, that we just can remove this branch and remove the large history with it. git however sometimes still keeps the commits under certain conditions (there are some rollback options which keep them). But I think there is definitely a solution to this

#268 helped to bring down the repository size to below 200 MB. We can now think if there are any other things we can remove to make it even smaller.

@lukaspie and @domna, great work, as it also implicitly addresses which of these many documentations we need to have on display. Two points
1.) Apart from the ".doctrees" directory also the "_source" repository can be deleted that is also sphinx build cache mainly *.rst.txt documents from which the html is generated
2.) There are still a couple of legacy PDF documents that we could get rid of in our fork and instead point people to as they are a part of the original NIAC repo.

Inspected the situation with the PDFs files. Turns out these are remnants of intermediate work that we inherited from the original NIAC branch. They had also a period in their history where they released the documentation into the same repo including pdfs sometimes not only the NeXusManual.pdf (which we only test for if it compiles but don't even store) but also the so-called ImpatientGuide.
Indeed, an inspection of commits related to *.pdf blobs identifies all of these blobs to be referred in commits in between 2011 and 2022, https://github.com/nexusformat/definitions/tree/2dbe08fe is an exemplar one such where pdf/ is right in the top level directory and this is why we are still carrying well of another approx. 100 MB of unnecessary copies with us.

@sanbrock @lukaspie we should propose this to NIAC and then remove this I am almost sure this payload is 100% from NIAC times still and worth to be erased from the fairmat branch. git-filter-repo applied in a sandbox and instructed to remove all pdf blobs reduces the repo down to 50 MB, given that there are also still some old publication I think it is worth to do this final step to have definitions finally for everybody in a blazing fast and clean condition.

I think a good idea would be to fork this repo and run the clean-up (with git-filter-repo) in the fork and then we can see what is the difference between this repo and the fork. Can you try that @mkuehbach? If it looks fine, we can then just do this for our own repo and afterwards suggest to NIAC that they do this themselves on their repo as well.

You mean that you would like to get that additional check and perspective of the GUI with a regular PR from that cleaned fork on my own github towards our nexusdefinitions (which is my forks upstream repo) right?

Yeah, this may work. I haven't used git-filter-repo much, so I am not sure if an actual PR would work.

So I followed this suggestion:
https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/removing-sensitive-data-from-a-repository#purging-a-file-from-your-repositorys-history
for www.github.com/mkuehbach/nxdefs-cleaned

Specifically:
1.) Cloned,
2.) Made a backup of .git/config
3.) Rewrote using the following command

git-filter-repo --invert-paths --path "pdf/" --path "legacy_docs/" --path "2010-05-10-workshop/" --path "workshop/" --path "misc/" --path "manual_archive/" --path "impatient/" --path "_static/" --path "_images/" --path "_sphinx/" --path "_sources/" --path "_downloads/"

4.) git count-objects -vH, went down to 53.24 MB
5.) Replaced .git/config in which as expected filter-repo removed the remote which a safety measure to avoid people accidentally pushing back with my backup
6.) git push origin --force --all

7.) git push origin --force --tags

  • 3c155ed2...104a4742 NXaperture-1.0 -> NXaperture-1.0 (forced update)
  • 599ae549...aa891add NXarchive-1.0b -> NXarchive-1.0b (forced update)
  • 622d4862...5577d1f8 NXarpes-1.0 -> NXarpes-1.0 (forced update)
  • 46f94d00...d501e80b NXattenuator-1.0 -> NXattenuator-1.0 (forced update)
  • 65d51b9e...a01e6b3a NXbeam-1.0 -> NXbeam-1.0 (forced update)
  • 3e95cd14...0a08bd0a NXbeam_stop-1.0 -> NXbeam_stop-1.0 (forced update)
  • a363afc9...0a7aa3c4 NXbending_magnet-1.0 -> NXbending_magnet-1.0 (forced update)
  • ce9810d7...bb9a9bbc NXcanSAS-1.0 -> NXcanSAS-1.0 (forced update)
  • c475ba38...fb3b5ea8 NXcanSAS-1.1 -> NXcanSAS-1.1 (forced update)
  • 478cff4a...149f065e NXcapillary-1.0 -> NXcapillary-1.0 (forced update)
  • 2a8d13f2...7b7b8678 NXcharacterization-1.0 -> NXcharacterization-1.0 (forced update)
  • 0ab60633...2cdaadfe NXcite-1.0 -> NXcite-1.0 (forced update)
  • aa4731a9...9b1c4f61 NXcollection-1.0 -> NXcollection-1.0 (forced update)
  • 6ab8feb0...f9f04de7 NXcollimator-1.0 -> NXcollimator-1.0 (forced update)
  • 7044c565...97ce4498 NXcontainer-1.0 -> NXcontainer-1.0 (forced update)
  • f5e33457...34a7b6fc NXcrystal-1.0 -> NXcrystal-1.0 (forced update)
  • d7569931...3b3f6246 NXdata-1.0 -> NXdata-1.0 (forced update)
  • 83800c18...ee42276e NXdetector-1.1 -> NXdetector-1.1 (forced update)
  • 1fdc32da...e64f6f7b NXdetector_group-1.0 -> NXdetector_group-1.0 (forced update)
  • 7d147663...e1231002 NXdetector_module-1.0 -> NXdetector_module-1.0 (forced update)
  • 51578ae9...ac475c45 NXdirecttof-1.0b -> NXdirecttof-1.0b (forced update)
  • a518676b...528e28c0 NXdisk_chopper-1.0 -> NXdisk_chopper-1.0 (forced update)
  • 12286771...c7369619 NXelectrostatic_kicker-1.0 -> NXelectrostatic_kicker-1.0 (forced update)
  • e7d81c36...54b68f78 NXentry-1.0 -> NXentry-1.0 (forced update)
  • 86f19fcf...14ba784d NXenvironment-1.0 -> NXenvironment-1.0 (forced update)
  • f6494019...8e7ddc0e NXevent_data-1.0 -> NXevent_data-1.0 (forced update)
  • f6c01190...4aee2d72 NXfermi_chopper-1.0 -> NXfermi_chopper-1.0 (forced update)
  • a603156b...d956ee9f NXfilter-1.0 -> NXfilter-1.0 (forced update)
  • c476f7e7...939598e0 NXflipper-1.0 -> NXflipper-1.0 (forced update)
  • ca18b76c...1df67c4a NXfluo-1.0 -> NXfluo-1.0 (forced update)
  • 9be978b8...58533a09 NXfresnel_zone_plate-1.0 -> NXfresnel_zone_plate-1.0 (forced update)
  • 1810c198...f1359042 NXgeometry-1.0 -> NXgeometry-1.0 (forced update)
  • e9d42ac8...7aa8e91f NXgrating-1.0 -> NXgrating-1.0 (forced update)
  • 088ca75d...7a802c90 NXguide-1.0 -> NXguide-1.0 (forced update)
  • fd8896cb...f9376f57 NXindirecttof-1.0b -> NXindirecttof-1.0b (forced update)
  • 9754c468...0b2b3f54 NXinsertion_device-1.0 -> NXinsertion_device-1.0 (forced update)
  • 56bb9fc7...3cfb4592 NXinstrument-1.0 -> NXinstrument-1.0 (forced update)
  • 2cae3069...2e35abf4 NXiqproc-1.0b -> NXiqproc-1.0b (forced update)
  • b7decf19...3b636f95 NXlauetof-1.0b -> NXlauetof-1.0b (forced update)
  • 014b8cac...83d56d31 NXlog-1.0 -> NXlog-1.0 (forced update)
  • 5ed7436b...8f075611 NXmagnetic_kicker-1.0 -> NXmagnetic_kicker-1.0 (forced update)
  • 2f56d8d8...74d3b9d6 NXmirror-1.0 -> NXmirror-1.0 (forced update)
  • b9240a6f...cebb4081 NXmoderator-1.0 -> NXmoderator-1.0 (forced update)
  • 1bab3c84...76d7a5d5 NXmonitor-1.0 -> NXmonitor-1.0 (forced update)
  • 917ba9c2...9d23b0ef NXmonochromator-1.0 -> NXmonochromator-1.0 (forced update)
  • 4e725a76...baf9d8eb NXmonopd-1.0b -> NXmonopd-1.0b (forced update)
  • acc6db8a...2ad0f27a NXmx-1.4 -> NXmx-1.4 (forced update)
  • a4e4bf12...f2571777 NXnote-1.0 -> NXnote-1.0 (forced update)
  • b8b0823e...38939d7b NXobject-1.0 -> NXobject-1.0 (forced update)
  • de2f33b0...b4d22f94 NXorientation-1.0 -> NXorientation-1.0 (forced update)
  • 4c12d321...76f2f46a NXparameters-1.0 -> NXparameters-1.0 (forced update)
  • 4e4b0fcb...b5367412 NXpinhole-1.0 -> NXpinhole-1.0 (forced update)
  • 5245e669...c4a0205a NXpolarizer-1.0 -> NXpolarizer-1.0 (forced update)
  • 96101ebf...38bb5e87 NXpositioner-1.0 -> NXpositioner-1.0 (forced update)
  • fcd8b706...d9fc3190 NXprocess-1.0 -> NXprocess-1.0 (forced update)
  • ccaac29c...1701ad4b NXquadrupole_magnet-1.0 -> NXquadrupole_magnet-1.0 (forced update)
  • 9b2bd4e0...499ae2e6 NXreflections-1.0 -> NXreflections-1.0 (forced update)
  • ee6af2d5...379102bb NXreflections-1.1 -> NXreflections-1.1 (forced update)
  • 1d73e493...076c333f NXrefscan-1.0b -> NXrefscan-1.0b (forced update)
  • 273b92b7...4ea59047 NXreftof-1.0b -> NXreftof-1.0b (forced update)
  • 1885825a...c008a15b NXroot-1.0 -> NXroot-1.0 (forced update)
  • 0dd5ab69...246ac4fa NXsample-1.0 -> NXsample-1.0 (forced update)
  • 47833fac...6716ed20 NXsample_component-1.0 -> NXsample_component-1.0 (forced update)
  • c8501b5f...2851dd69 NXsas-1.0b -> NXsas-1.0b (forced update)
  • 23c917dc...dc368996 NXsastof-1.0b -> NXsastof-1.0b (forced update)
  • fe158542...fdb93365 NXscan-1.0b -> NXscan-1.0b (forced update)
  • 07160eb3...1ec3e46c NXsensor-1.0 -> NXsensor-1.0 (forced update)
  • 77c2a0ad...2c082f29 NXseparator-1.0 -> NXseparator-1.0 (forced update)
  • e5257583...8f6ccdea NXshape-1.0 -> NXshape-1.0 (forced update)
  • c4fb7dd3...f5b040ad NXslit-1.0 -> NXslit-1.0 (forced update)
  • 95a2c05b...727b026e NXsnsevent-1.0 -> NXsnsevent-1.0 (forced update)
  • e46978b3...1809a203 NXsnshisto-1.0 -> NXsnshisto-1.0 (forced update)
  • b8ec89ff...63c41576 NXsolenoid_magnet-1.0 -> NXsolenoid_magnet-1.0 (forced update)
  • 4f691a50...4225d9ea NXsource-1.0 -> NXsource-1.0 (forced update)
  • 44225e03...463ea8db NXspe-1.0 -> NXspe-1.0 (forced update)
  • 5a7dfc14...da704024 NXspecdata-1.0 -> NXspecdata-1.0 (forced update)
  • 140f8b59...8ff02ba8 NXspin_rotator-1.0 -> NXspin_rotator-1.0 (forced update)
  • 371be185...99908f59 NXsqom-1.0b -> NXsqom-1.0b (forced update)
  • c29d1ab4...e80e5815 NXstxm-1.1 -> NXstxm-1.1 (forced update)
  • 132916e2...081fad65 NXsubentry-1.0 -> NXsubentry-1.0 (forced update)
  • f3efbcc6...82624135 NXtas-1.0b -> NXtas-1.0b (forced update)
  • 9cb713b7...9461b3a4 NXtofnpd-1.0b -> NXtofnpd-1.0b (forced update)
  • e9f7bd72...6b805949 NXtofraw-1.0b -> NXtofraw-1.0b (forced update)
  • 0165b275...ad4b8908 NXtofsingle-1.0b -> NXtofsingle-1.0b (forced update)
  • 669badc1...69e8de54 NXtomo-2.0 -> NXtomo-2.0 (forced update)
  • 2a5e38bd...57fc6bba NXtomophase-1.0b -> NXtomophase-1.0b (forced update)
  • 39e95170...97a30155 NXtomoproc-1.0b -> NXtomoproc-1.0b (forced update)
  • 88e82eeb...a7366be9 NXtransformations-1.0 -> NXtransformations-1.0 (forced update)
  • dba3fcfb...2998054c NXtranslation-1.0 -> NXtranslation-1.0 (forced update)
  • bee9ccc4...a19aabca NXuser-1.0 -> NXuser-1.0 (forced update)
  • 093df0c0...96ca4c42 NXvelocity_selector-1.0 -> NXvelocity_selector-1.0 (forced update)
  • 6441ff85...4f44a19e NXxas-1.0 -> NXxas-1.0 (forced update)
  • 29876a82...33fb2671 NXxasproc-1.0 -> NXxasproc-1.0 (forced update)
  • ae582897...ee48d738 NXxbase-1.0b -> NXxbase-1.0b (forced update)
  • 0ab8f2fa...558bba9a NXxeuler-1.0b -> NXxeuler-1.0b (forced update)
  • 07600ad0...4067a357 NXxkappa-1.0b -> NXxkappa-1.0b (forced update)
  • 1eb538a9...1dca7eb3 NXxlaue-1.0b -> NXxlaue-1.0b (forced update)
  • 4541ed5a...c9eacb69 NXxlaueplate-1.0b -> NXxlaueplate-1.0b (forced update)
  • 71c00abb...d0517a91 NXxnb-1.0b -> NXxnb-1.0b (forced update)
  • a0fbf33a...3308f3a1 NXxraylens-1.0 -> NXxraylens-1.0 (forced update)
  • f0eb57de...c4b1df0e NXxrot-1.0b -> NXxrot-1.0b (forced update)
  • 4aa4215...6c37cd5 Schema-3.3 -> Schema-3.3 (forced update)
  • aa1ccd1...9057303 Schema-3.4 -> Schema-3.4 (forced update)
  • 09fd569...b8d07ce create_zenodo_doi -> create_zenodo_doi (forced update)
  • a82f846...6c1dba5 list -> list (forced update)
  • a3045fd...c35bf0f v2018.5 -> v2018.5 (forced update)
  • 5c4cfec...3719a11 v2020.1 -> v2020.1 (forced update)
  • 0649b8a...91f4196 v2020.10 -> v2020.10 (forced update)
  • 4f6ef8d...7d807be v2020.1rc1 -> v2020.1rc1 (forced update)
  • ddd9514...daf01db v2020.1rc2 -> v2020.1rc2 (forced update)
  • 5274c979...3a44429c v2022.06 -> v2022.06 (forced update)
  • 67c35c4...6ec6c44 v2022.06rc0 -> v2022.06rc0 (forced update)
  • 5d1e35f...09d50b6 v2022.06rc1 -> v2022.06rc1 (forced update)
  • ca8a3d5...385bad9 v2022.06rc2 -> v2022.06rc2 (forced update)
  • 7a506f0...fa8e077 v2022.06rc3 -> v2022.06rc3 (forced update)
  • e859836...2a9f158 v2022.06rc4 -> v2022.06rc4 (forced update)
  • 026f9e5f...bdca8f39 v2022.07 -> v2022.07 (forced update)
  • 54df1c1...7b9b201 v2024.02 -> v2024.02 (forced update)
  • 9d4e753...f07fb18 v3.1.0 -> v3.1.0 (forced update)
  • e888dac...a764c98 v3.2 -> v3.2 (forced update)
  • 14aecd1f...8a02d6f1 v3.3 -> v3.3 (forced update)

But now one has to contact GitHub to request them to remove dangling references ...

image

So in principle one could do that HOWEVER:
Changed commit SHAs may affect open pull requests in your repository.
We recommend merging or closing all open pull requests before removing files from your repository.

So I did the experiment we have now some understanding about it but there are open PRs from us and others.
To prevent possible harm, I will therefore delete my fork now as it can easily be recreated ones
we have all PRs merged and closed if we want to pursue this force push then.

@lukaspie @sanbrock above is my report about this exercise

Thanks for checking. To me, it doesn't really seem worth the effort of potentially having to recover things that could potentially break. The repo is now relatively small and due to the new deploy workflow, it will not grow much bigger in the future. Therefore, I vote we skip this git-filter-repo step.

I'll close the issue now, we can make a new one if we ever want to consider this again.

image

So in principle one could do that HOWEVER: Changed commit SHAs may affect open pull requests in your repository. We recommend merging or closing all open pull requests before removing files from your repository.

So I did the experiment we have now some understanding about it but there are open PRs from us and others. To prevent possible harm, I will therefore delete my fork now as it can easily be recreated ones we have all PRs merged and closed if we want to pursue this force push then.

I think merging such a fork/PR will not remove any data, but rather add another 4k commits. This is certainly not what we want, so really only the force-pushing of the cleaned repo would do. But as Lukas commented, I would also not vote for such a breaking change, as you have nicely solved the main issue without this.

Thank you that we have all understood that this PR was never meant to be actually merged but meant to serve as an exercise for a person with owner-equivalent role and rights interested in cleaning the repository.