pangeo-data/pangeo-stacks

clean out package dir to reduce image size?

Opened this issue · 15 comments

Our docker images are storing about 2.6 GB worth of conda packages in /srv/conda/pkgs

$ du -h -d1 /srv/conda
4.0K    /srv/conda/envs
2.7G    /srv/conda/pkgs
4.0K    /srv/conda/compiler_compat
31M     /srv/conda/bin
128K    /srv/conda/etc
4.0K    /srv/conda/conda-bld
25M     /srv/conda/conda-meta
539M    /srv/conda/lib
12K     /srv/conda/x86_64-conda_cos6-linux-gnu
7.6M    /srv/conda/include
8.0K    /srv/conda/ssl
8.0K    /srv/conda/man
314M    /srv/conda/share
12K     /srv/conda/shell
92K     /srv/conda/libexec
412K    /srv/conda/sbin
640K    /srv/conda/mkspecs
8.0K    /srv/conda/condabin
20K     /srv/conda/docs
12K     /srv/conda/translations
36K     /srv/conda/doc
332K    /srv/conda/plugins
4.0K    /srv/conda/phrasebooks
156K    /srv/conda/qml
12K     /srv/conda/var
3.6G    /srv/conda

Do we actually need this? Can we clean it out and drastically reduce the size of the images?

Any thoughts on this anyone?

jupyterhub/repo2docker#638 implements this in repo2docker.

I'm going to re-open this with the goal of reducing our image sizes. The base image is still at 950 Mb compressed and 2.7GB pulled:

pangeo/base-notebook 2019.06.24 e51d49f3c1ed 7 hours ago 2.7GB

Some related resources and discussion:
https://jcrist.github.io/conda-docker-tips.html
pangeo-data/pangeo-cloud-federation#305
jupyterhub/repo2docker#714

@scottyhq are you planning to work on this? I can put some time into it if you want (ironically, while waiting for my new images to be pulled 😄)

The issue to track about repo2docker speed ups, smaller images, etc is jupyterhub/repo2docker#707

I’m not going to be working on this myself in the near future, so any contributions are welcome ;) just wanted to connect some dots.

Looked into this a bit last night. In terms of what pangeo-stacks can fix itself, #116 is the biggest offender I think.

I'm going a bit further up the stack now. There's a lot in the base image from repo2docker that we probably don't need.

  1. R2D base image uses buildpacks:bionic (397MB). This is probably larger than we need since it has things like GCC.
  2. We have two installs of nodejs / npm. One from apt-get in /usr/local and one in the conda notebook env.
  3. Conda env: There are a few libraries that may not be appropriate for a base image, if we're going for minimal size
  • nbcovert (120 MB). From the base r2d notebook env. Brings in pandoc & pandoc-citeproc
  • nteract_on_jupyter (142 MB). From the base r2d env. Not sure why it's so large yet.

I'll investigate a bit more before reporting those upstream.

In terms of what pangeo-stacks can fix itself, #116 is the biggest offender I think.

In terms of what is not directly related to pangeo-stacks we (I?) could work on is to resume the static library split on conda-forge.

These are some numbers I presented at the Seattle pangeo meeting for a basic geospatial env with conda-forge:

1.8G	GEO  # old
1.7G	CONDA_GEO  # current (only a few splits, stopped at libnetcdf)
452M	no-static_GEO  # removing all `.a` files from the env.

Note that one can already removed all the .a from the envs with something like:

find /opt/conda/ -follow -type f -name '*.a' -delete

If you are not doing that already you can definitely try. The splitting in conda-forge will give you a choice to have those files when you need them though (which is rare but not zero).

2. We have two installs of nodejs / npm. One from apt-get in /usr/local and on in the conda notebook env.

Thanks for reviving this discussion @TomAugspurger . For some perspective on nodejs coming from two places see this discussion on repo2docker a while back jupyterhub/repo2docker#728 . It might now be possible to just get it from Conda-forge if you want to re-raise the issue there.

Another way to reduce size is to remove everything related to qt. Cloud jupyter deployment rarely need them.

To avoid that one must substitute jupyter for jupyter_core in the first layer of this stack (repo2ocker?) Here are some numbers

271M	JUPYTER_AND_JUPYTERLAB
192M	JUPYTER-CORE_AND_JUPYTERLAB

~80 MB difference.

Also, the pangeo-notebook is pulling the mkl package mkl-2019.5 | 205.2 MB! See logs or pull 'latest' image:
https://github.com/pangeo-data/pangeo-stacks/runs/428527057?check_suite_focus=true

is there an easy way to see what package pulls mkl as a dependency? because its not listed explicitly in our environment.yaml: https://github.com/pangeo-data/pangeo-stacks/blob/master/pangeo-notebook/binder/environment.yml

You can typically include the nomkl package, which should prevent it from being pulled.

You can typically include the nomkl package, which should prevent it from being pulled.

I was investigating this but it looks like both workarounds, nomkl and installing blas=*=openblas no longer work. Not sure what is happening but we added mkl in conda-forge recently and that may be the culprit.

Update on the mkl package problem here. Adding nomkl won't work b/c conda-forge's blas implementation drifted a little bit from defaults (The whole discussion is in our gitter channel if someone is interested).

We will probably solve that with conda-forge/staged-recipes#10922. Note that the conda-forge nomkl will not remove/prevent mkl from getting into the env here! It will only cause a conflict with the package that is pulling mkl. We can then debug why that package is doing it. (Like, do we have an openblas version of it? Or is the mkl variant getting precedence over the openblas variant due to an error, etc.)