rasbt/watermark

Wrong package versions within virtualenv

mrbell opened this issue · 17 comments

I'm not sure how widespread this issue is, or if it's something particular to my setup, but when I use watermark to report package information (using -p) within a jupyter notebook that is running in a virtualenv, version information about system level packages are reported rather than packages installed within my environment. If a package is installed in my virtualenv and not installed at the system level, then I end up getting DistributionNotFound exceptions when I try to report the version number using watermark.

The problem stems from the call to pkg_resources.get_distribution. If I use get_distribution directly from within my notebook to report the version of e.g. numpy it shows my system level numpy info (v 1.11.0) rather than info about numpy installed within my virtualenv (v 1.11.1). For example:

pkg_resources.get_distribution('numpy')
numpy 1.11.0 (/usr/local/lib/python2.7/site-packages)

Similarly

%watermark -p numpy
numpy 1.11.0

But when I check the version of what is imported:

import numpy as np
np.__version__
'1.11.1'

If I use pkg_resources from a python console or script running within a virtualenv, it reports the proper information about packages. But if I use it from an ipython console or jupyter notebook it reports system level information.

If this is a widespread issue, should watermark find another more robust solution to getting package information other than using pkg_resources?

rasbt commented

Thanks for the info; it's definitely worth looking into it!

It works just fine for me via conda, so maybe just a few questions before we dig deeper into this issue ... ;)

  • Have you installed the latest watermark (1.3.1), the one that is installed as a normal package now instead of using %install_ext? Have you had the latter installed previously, and are the maybe some leftovers from this installation in you ipython extension directory?
  • have you checked that you are running the virtual env's jupyter notebook, e.g., by a which jupyter on the command line and import sys; sys.argv inside the notebook?

If this is a widespread issue, should watermark find another more robust solution to getting package information other than using pkg_resources?

I am definitely open to improvements if this is indeed a "bug" in this case. Do you have any recommendations? Ideally something that doesn't require external dependencies?

Thanks for the quick response!

  • I have installed the latest watermark and have not had previous versions installed.
  • I do not have jupyter installed in my environment. which jupyter gives /usr/local/bin/jupyter

I ran some tests in fresh environments both with and with jupyter installed in the environment. The test fails without jupyter installed (as reported) but passes when jupyter is installed.

So installing jupyter within the environment fixes the issue. Maybe that's the right thing to do anyway? I haven't ever done it before, though. It's probably worth noting this as a requirement in the watermark documentation. Or maybe adding jupyter as a requirement that is installed along side watermark?

An aside on the ipython console:

I don't know if people use watermark from the ipython console, but if so, simply installing ipython in the virtualenv may not work if the system ipython has been run previously within the virtualenv. Here is an example:

  1. Create and activate a fresh virtualenv, do not install ipython in the virtualenv
  2. Launch the ipython console
    • the system ipython will be called (but like jupyter it recognizes that you're in a virtualenv so typical imports of virtualenv installed packages are fine)
  3. Exit ipython
  4. Now pip install watermark, which installs ipython
  5. which ipython will report the virtualenv ipython executable. So we're good, right?
  6. Launch an ipython console
  7. import sys; sys.argv shows that the system ipython console is still being used
  8. watermark/pkg_resources refer to system level packages, not those in the virtualenv
rasbt commented

So installing jupyter within the environment fixes the issue. Maybe that's the right thing to do anyway?

Hm, I usually install ipython/jupyter notebook as well in virtual environments, but maybe that's not a "normal" thing to do ;).

Just to summarize your points to make sure that I understand it correctly:

  • if you run the system jupyter notebook in a virtual environment, watermarks gets version numbers from system python packages
  • if you run the virtualenv's jupyter notebook in a virtual environment, it watermarks gets version numbers from virtualenv python packages

But you are saying that if you run the system's jupyter notebook in a virtual environment, it will import the virtual environment's packages, not the system python packages, right?

So, I guess what we want watermark to report the package version of the package that the jupyter notebook actually imports, no matter whether it's a system or virtual environment notebook. How to do this elegantly I don't know, yet. Hopefully, there's a way to achieve this via pkg_resources.get_distribution, I will look a bit around in hope to find a good solution. And please let me know if you have any solution or ideas! :).

But you are saying that if you run the system's jupyter notebook in a virtual environment, it will import the virtual environment's packages, not the system python packages, right?

Right.

we want watermark to report the package version of the package that the jupyter notebook actually imports, no matter whether it's a system or virtual environment notebook

This is the behavior I would expect.

I will give it some thought and let you know if I have any bright ideas.

rasbt commented

Unfortunately, I don't have a solution at hand (yet), but I just checked in via conda (running the non-virtual environment jupyter notebook in an active virtual conda environment, and it imports the "correct" package, that is, the package from the active virtual environment. So my guess is it's somehow an issue that is related to virtualenv specfically, maybe how they handle paths. I also found a related issue on SO (but without solution unfortunately).

http://stackoverflow.com/questions/37779083/pkg-resource-detect-the-wrong-data-files-path-in-virtualenv

Reporting the same package info that is available via pip freeze should work, right? The installed distributions can be retrieved in code via the pip package (see e.g. this SO answer)

So that would add pip as a dependency (which seems reasonable), but then _get_packages could look something like:

if self.out:
    self.out += '\n'
packages = pkgs.split(',')
installed_dists = {i.key: i.version for i in pip.get_installed_distributions()}
for p in packages:
    self.out += '\n%s %s' % (p, installed_dists[p])
rasbt commented

Yeah that could work. Plus I think that an additional flag --list_all_pkg or so could be added to watermark (would not recommend it as a default though since it may be a long list of packages not relevant to the project in certain cases).

Have you checked if this grabs the correct numpy version in your system jupyter nb, i.e., the virtualenv one?

Let me know if you'd like to submit it as a pull request, otherwise, I am happy to give it at try on the weekend.

I think it would also be worthwhile posting this issue on the virtualenv issue tracker -- I think this is likely not desired behavior (e.g., since pkg_resources.get_distribution works fine in conda's virtual environment).

Yes, this grabs the correct version for me when I use the system jupyter notebook.

I'll submit a PR later this evening.

rasbt commented

Alternatively, we could use something like

import sys 

module_names = ['numpy']
for m in module_names:
    if hasattr(sys.modules[m], '__version__'):
        print(m, sys.modules[m].__version__)

I just see that pip is also using pkg_resources

def get_installed_distributions(local_only=True,
                                skip=stdlib_pkgs,
                                include_editables=True,
                                editables_only=False,
                                user_only=False):
    """
    Return a list of installed Distribution objects.

    If ``local_only`` is True (default), only return installations
    local to the current virtualenv, if in a virtualenv.

    ``skip`` argument is an iterable of lower-case project names to
    ignore; defaults to stdlib_pkgs

    If ``editables`` is False, don't report editables.

    If ``editables_only`` is True , only report editables.

    If ``user_only`` is True , only report installations in the user
    site directory.

    """
    if local_only:
        local_test = dist_is_local
    else:
        def local_test(d):
            return True

    if include_editables:
        def editable_test(d):
            return True
    else:
        def editable_test(d):
            return not dist_is_editable(d)

    if editables_only:
        def editables_only_test(d):
            return dist_is_editable(d)
    else:
        def editables_only_test(d):
            return True

    if user_only:
        user_test = dist_in_usersite
    else:
        def user_test(d):
            return True

    return [d for d in pkg_resources.working_set
            if local_test(d) and
            d.key not in skip and
            editable_test(d) and
            editables_only_test(d) and
            user_test(d)
            ]

I would really prefer a solution without adding additional dependencies here if possible

The sys.modules solution works for me. I'll submit a new PR dropping the pip requirement.

I also have a separate issue (must be 2.7 specific?) using the -w flag. __version__ is called without being referenced to list the watermark version. The fix that I included in PR #16 doesn't work in python 3 apparently. How about adding a _version.py file with the version string in it and that can be referenced from both __init__.py and watermark.py?

rasbt commented

Thanks for the PR!

The sys.modules solution works for me. I'll submit a new PR dropping the pip requirement.

Yeah, I think that would be a leaner solution!

I would extend it a bit to cover some edge cases that don't have a __version__ attribute. I think I've seen some older ones with .version

import sys 

module_names = ['numpy']

for m in module_names:
    s += '%s ' % m
    if hasattr(sys.modules[m], '__version__'):
        s += sys.modules[m].__version__
    elif hasattr(sys.modules[m], 'version'):
        s += sys.modules[m].version
    else:
        s += 'n/a'
s = s.rstrip()

I was just suggesting something like that regarding the version :-)

How do you feel about throwing a custom exception if m is not in sys.modules? As it stands a KeyError would be thrown and that may not too helpful for many people.

rasbt commented

How do you feel about throwing a custom exception if m is not in sys.modules? As it stands a KeyError would be thrown and that may not too helpful for many people.

Yeah, great idea!

btw. don't feel rushed with implementing everything, I am probably not able to take a close look at the PR before the weekend!

rasbt commented

Oooops, my sys.modules proposal was a bit naive as I just figured out after the pull request. It worked fine first, but then it didn't. So, this method currently only works if the module has been imported _before_ calling watermark. E.g., if watermark is the first cell in a notebook, it won't find the packages. I guess we need to fix that again ... :(

Let's not rush it though and think about some better options (and if nothing else works, just using pip instead)

PS: Reminds me that I want to improve the CI at some point #14

Sorry, didn't mean to rush. I just had time last night and thought I'd take the opportunity to try to take care of it.

Here is an admittedly hacky solution but it works:

for p in packages:
    version_str = subprocess.check_output(
        '''python -c "import pkg_resources as pr; print(pr.get_distribution('{:}'))"'''.format(p),
        shell=True
    ).rstrip().split()[1]

    self.out += '\n%s %s' % (p, version_str)

This calls the virtualenv python, so it returns the correct package info.

I tried combining the for loop into a one-liner call to subprocess but for whatever reason couldn't get it to work, e.g.

subprocess.check_output(
    '''python -c "import pkg_resources as pr; print([pr.get_distribution(m) for m in [{:}]])"'''.format(
        "'{:}'".format("','".join(packages))
    ), 
    shell=True
)
rasbt commented

No worries! Instead of using subprocess, wouldn't a simpler __import__('numpy') work?

__import__('numpy')
m, sys.modules['numpy'].__version__
...

EDIT:

Alternatively,

>>> import importlib
>>> importlib.import_module("numpy").__version__
'1.11.1'

would work as well.But I am not sure if there's an advantage vs __import__

>>> __import__('numpy').__version__
'1.11.1'

What do you think about something like:

packages = ['numpy']

for p in packages:
    imported = __import__(p)
    # will raise an ImportError: No module named p
    # so we could remove the PackageNotFoundError(Exception)
    # I guess

    try:
        ver = imported.__version__
    except AttributeError:
        try: 
            ver = imported.version
        except AttributeError:
            try: 
                ver = imported.version_info
            except AttributeError:
                ver = 'n\a'


    print(ver)

I think this could work if I am not overlooking something :)

rasbt commented

Just had some extra time and it is (hopefully) fixed now (#19)! Thanks for the discussion and help, without your help, I probably wouldn't have discovered this tricky version bug!