bazelbuild/rules_python

pip_import builds source packages on workspace load

siddharthab opened this issue · 12 comments

pip_import currently installs all the wheels in a single repository, the same repository as the requirements.bzl file. That implies loading the requirements.bzl file in WORKSPACE also builds source packages for which wheels are not available (e.g. pysam). This makes the initial load slow, and requires all users to have the right environment setup to build any source packages into wheels, even if they don't do any python in their work.

It will be nice to have a "no-deps" mode where the requirements.txt file is expected to be the transitive closure (e.g. from pip freeze), and we have a separate repo for each package (relying on the --no-deps argument to pip wheel), such that they are downloaded and potentially built on demand, rather than on first load.

This also allows users to skip a dependency from pip altogether, and provide their own substitute by appending to the requirement dict.

Something like the below repository rule that takes a single package name and version, downloads the wheel and calls whltool.par with the whl file. Users can provide their own requirements.bzl file with the requirement dict.

def _pip_repository_impl(rctx):
    if not rctx.which("python"):
        fail("python not found")
    if not rctx.which("pip"):
        fail("pip not found")

    exec_result = rctx.execute(["pip", "wheel",
                                "%s==%s" % (rctx.attr.package, rctx.attr.version),
                                "--no-deps",
                                ] + rctx.attr.wheel_args,
                               environment = rctx.attr.wheel_env)
    if exec_result.return_code:
        fail("Failed to obtain package wheel: \n%s\n%s" % (exec_result.stdout, exec_result.stderr))
    if exec_result.stderr:
        print(exec_result.stderr)

    exec_result = rctx.execute(["find", ".", "-type", "f", "-name", "*.whl"])
    if exec_result.return_code:
        fail("Could not find package wheel: \n%s" % exec_result.stderr)
    whl_file = exec_result.stdout.rstrip()

    args = [
        "python",
        rctx.path(rctx.attr._script),
        "--whl", whl_file,
        "--requirements", rctx.attr.requirements,
    ]

    if rctx.attr.extras:
        args += [
          "--extras=%s" % extra
          for extra in rctx.attr.extras
        ]

    result = rctx.execute(args)
    if result.return_code:
        fail("whl_library failed: %s (%s)" % (result.stdout, result.stderr))

    rctx.execute(["rm", whl_file])

    return

pip_repository = repository_rule(
    attrs = {
        "package": attr.string(mandatory = True),
        "version": attr.string(mandatory = True),
        "requirements": attr.string(mandatory=True),
        "extras": attr.string_list(),
        "wheel_args": attr.string_list(),
        "wheel_env": attr.string_dict(),
        "_script": attr.label(
            executable = True,
            default = Label("@io_bazel_rules_python//tools:whltool.par"),
            cfg = "host",
        ),
    },
    local = False,
    implementation = _pip_repository_impl,
)

@siddharthab That is something er are quite interested as well. Are you using that right now? If so do you have en example of how it is being used?

Just looking further into it, imo it loading all requirements is not so much the issue (though I agree it would be nice to have a separate repository for each requirement), the main problem seems to be the need for the pip_install which installs all requirements on load rather than on a python build. If there was no pip_install function and deps would be declared e.g. via @pip_import_repository_name//:pip_library, so @py_deps//:freezegun or @py_deps//:thumbor rather than the current requirement function than that would also improve on the current situation.

But maybe a half-way solution does not make too much sense and might as well go all the way. Or I wonder if there was a way to generate multiple external repositories from one repository_rule similar to how it is done currently but without actually downloading/building the deps and then one would have to depend on the external repository directly, such as @pypi__argparse_1_4_0//:pkg.

Yes. We have been using the scheme as I mentioned above. It has been working fine for us.

@siddharthab But then you have to specify pip dependencies as well as their sub-dependencies by hand, right?

I just made a quick-and-dirty hack based on the existing rules_python that takes an requirements.txt file and just puts all those dependencies into the same external repository. So whenever one builds anything python related all dependencies in this requirements.txt are loaded, but at least it only loads those when anything related is being built.

Here is that code:

def _pip_requirements_repository_impl(rctx):
    if not rctx.which("python"):
        fail("python not found")
    if not rctx.which("pip"):
        fail("pip not found")

    rctx.file("BUILD", "")    

    req = '''
def requirement(name):
    name_key = name.replace("-", "_").lower()
    return "@%s//" + name_key + ":pkg"
''' % (rctx.attr.name)
    rctx.file("requirements.bzl", req)

    exec_result = rctx.execute(["pip", "wheel",
                                "-r%s" % rctx.path(rctx.attr.requirements),
                                ] + rctx.attr.wheel_args,
                               environment = rctx.attr.wheel_env)

    if exec_result.return_code:
        fail("Failed to obtain package wheel: \n%s\n%s" % (exec_result.stdout, exec_result.stderr))
    if exec_result.stderr:
        print(exec_result.stderr)

    exec_result = rctx.execute(["find", ".", "-type", "f", "-name", "*.whl"])
    if exec_result.return_code:
        fail("Could not find package wheel: \n%s" % exec_result.stderr)

    whl_files = exec_result.stdout.rstrip().split("\n")

    for whl_file in whl_files:
        whl_dir = str(whl_file).lower().split("-")[0]
        result = rctx.execute(["mkdir", whl_dir])

        if result.return_code:
            fail("mkdir failed: %s (%s)" % (result.stdout, result.stderr))

        args = [
            "python",
            rctx.path(rctx.attr._script),
            "--whl", whl_file,
            "--requirements", "@%s//:requirements.bzl" % rctx.attr.name,
            "--directory", whl_dir
        ]

        if rctx.attr.extras:
            args += [
            "--extras=%s" % extra
            for extra in rctx.attr.extras
            ]
        result = rctx.execute(args)
        if result.return_code:
            fail("whl_library failed: %s (%s)" % (result.stdout, result.stderr))

        rctx.execute(["rm", whl_file])

    return

pip_requirements_repository = repository_rule(
    attrs = {
        "requirements": attr.label(
            allow_files = True,
            mandatory = True,
            single_file = True,
        ),
        "extras": attr.string_list(),
        "wheel_args": attr.string_list(),
        "wheel_env": attr.string_dict(),
        "_script": attr.label(
            executable = True,
            default = Label("@io_bazel_rules_python//tools:whltool.par"),
            cfg = "host",
        ),
    },
    local = False,
    implementation = _pip_requirements_repository_impl,
)

Usage:

# WORKSPACE file
pip_repository(
    name="py_deps",
    requirements = "//:requirements.txt",
)
# BUILD file
load("@py_deps//:requirements.bzl", "requirement")
py_library(
    name = "lib",
    srcs = ["lib.py"],
    deps = [
        requirement("mock"),
    ],
)

So the usage stays exactly the same - has not been extensively tested.

Yes, I think specifying every dependency manually is the bazel way. Without it, the versions of all your dependencies is not fixed for every user. And a person running the repository rule at a later date may get different versions for some of the unstated dependencies.

To list all the dependencies, I created a small script for my users that installs the package in virtualenv and gets a requirements.txt file out of the virtualenv. It's not pretty, but that's what I could think of.

You are right, it is also what I see as the bazel way, but interestingly also rules_nodejs has gone a different way by supporting package.json / lock files rather than specifying each dependency individually.

Either way, I am interested in the pip_repository approach you created. Can you by any chance share the script you put together to get the requirements.txt out of the virtualenv? Would be very useful.

#!/bin/bash
# Utility script to generate a requirements file for a transitive closure of
# dependencies of a package.
# The arguments to this script are arguments to pipenv install.

set -euo pipefail

readonly tmp_dir="$(mktemp -d)"

function cleanup {
  rm  -rf "${tmp_dir}"
}
trap cleanup INT HUP QUIT TERM EXIT

cd "${tmp_dir}"
if ! command -v pipenv; then
  echo "pipenv not found; you can install by running:"
  echo "  pip install --user pipenv > /dev/null"
fi
pipenv install "$@" > /dev/null

echo ""
echo "Your package requirements are:"
pipenv run pip freeze

Awesome thanks for sharing @siddharthab

rules_python 0.1.0 has been released which upstreams the rules_python_external repo. Please switch from pip_import to pip_install which doesn't have this issue.

@alexeagle could you please clarify what was fixed on version 0.1.0. We're using this version and it still downloads and installs each and every dependency from source requirements.txt on initial run, despite the bazel target we run only has a few dependencies out of the whole list.

Sorry you're totally right. I misunderstood the implementation when I wrote that, and forgot to update here.

The correct resolution is
#432