piskvorky/smart_open

Don't package all cloud dependencies at once

Tyrannas opened this issue ยท 31 comments

Problem description

Smart open is a really useful library, but I find it a bit annoying to have all dependencies packaged with it. Most of the time one doesnt need to manipulate s3 and gcs and maybe azure storage if it were to be integrated in smart-open.

What I wish I could do:

pip install smart-open would be the same behaviour as now
pip install smart-open[s3] would only install boto3 dependencies
pip install smart-open[gcs] same for gcs
...

Note:

If you find it interesting I can assign this to myself and work on the project

BTW: i think it's the same behaviour for gensim, it packages boto3 but is not needed for common nlp tasks

I was just about to open this exact same issue, what are the odds!

Thanks for the feedback (and offer to work on it!). We'll discuss with @mpenkov .

Related to #440.

@Tyrannas @mwozniczak Out of interest, what is the problem with smart_open installing boto3, gcs, etc. by default? The installation is automatic, so you don't really need to do anything.

(It's a genuine question. It's obvious that this is an issue for some people, and I'd like to understand why.)

@mpenkov speaking from my experience, there's two main factors: one is that there's a somewhat hard limit on AWS lambda filesize and when you've got some heavy libs like numpy+pandas, it starts to add up.

second one is, for some reason I needed to add an explicit reference to google-api-core in my requirements, otherwise it did not work (even though all dependencies were installed). and next thing you know, you have to justify yourself on every relevant merge request review as to "why the hell do we need google APIs if we're doing AWS lambdas exclusively".

@mpenkov same remarks for us, we tend to generally have a low number of requirements as possible. On bigger projects it can reduce the likelihood to have conflicts between versions.

Other concerns are build time + docker image sizes. If we have less requirements then we have smaller docker images + a smaller build time, it's a more efficient process.

OK, those reasons make sense to me.

@Tyrannas I'm glad you're interested in implementing this. I'm in the middle of a refactoring right now (#408) - once that's merged, please make a PR.

@Tyrannas The PR has been merged, please go ahead.

Alright, I'll fork and create a WIP PR

Update on this: @mwozniczak @mpenkov
I said that "pip install smart_open" would still be the same, well it seems that one cannot package 'default' extras: pypa/setuptools#1139

This means that if I modify the setup to allow the pip install smart_open[boto3] of pip install smart_open[gcs] the pip install smart_open with no extras would only install requests.

Which means that users with no fixed version in their dependencies and using s3 or gcs should then either modify their requirements to smart_open[boto3] or fix a previous version.

Are you ok with that?

@piskvorky WDYT? Seems like it will inconvenience more people than it would help.

@Tyrannas Is there a way to disable certain extras explicitly? That way, people who need a lean install can request one explicitly.

A hacky way would be to use an environment variable to specify extras explicitly. When empty, assume the user wants all the extras. So, if you just want S3, you'd do:

SMART_OPEN_EXTRAS=s3 pip install smart_open

and then handle that in setup.py as something like:

if os.environ['SMART_OPEN_EXTRAS'] == 's3':
    setup_requires = [..., 'boto3']

This is un-Pythonic, but satisfies the use case without breaking default behavior.

Huum I'm not quite fan of the env variable option, but I guess it could work.
Another more pythonic workaround would be (according to pip documentation) to pass parameters to the setup.py and to modify install requires depending on this parameters.
This option seems better since I think it can be handled in a requirements.txt file.

so:
smart_open --install-option="--module='boto3'"
would seem to work (havent tried it yet though) @gfournier any opinion on this ?

Yes, I'd prefer the default to remain "use everything" = the current behaviour = the most common use-case (most people don't care).

But I like the idea of a lean install too, it's a valid use-case. Not sure what's the best way to achieve that.

How about pip install smart_open --install-option="--install_only=s3,gcs,ftp"? Or maybe --global-option, I don't really understand the difference.

Yeah, --install-option is probably better than env vars.

smart_open --install-option="--module='boto3'"

I would avoid letting the user specify which specific packages to install. That's really a task for pip, not for us, because it could involve dependency management problems. We could do:

smart_open --install-option="--no-extra"

and then let people pip install the packages they need themselves.

@piskvorky's suggestion also looks good:

smart_open --install-option="--install_only=s3,gcs,ftp"

but also open to dependency problems (e.g. conflicts between packages required by e.g. S3 and GCS).

Well I'm afraid those solutions would not be userfriendly at all, and are not easily packagable (which is the real problem here). What I suggest is to hope for a update on the setuptools issue before changing smart_opens packaging if that suits everyone

@Tyrannas not sure about that. if the default would be "install everything", then I think I personally would be okay with @mpenkov's idea, as it's more of a poweruser feature anyway.

side note, are the extras-related values magic in some way in how setuptools works with them? I mean, a lot of the setup() function's parameters can be generated on the fly, as they're pretty much just python code (hell, you could even use a dict and go with a setup(**my_dict) syntax if need be, had to go that route myself at one point). is this different?

@mwozniczak yeah I know but was I was saying is that is it not an easily packagable solution. Like you can indeed write --install-option in a requirements.txt but it seems that it can lead to some side effects: pypa/pip#5795 (comment)

OK, so it sounds like there are no perfect solutions here.

  1. Wait for pypa/setuptools to fix this (don't know when or if this will ever happen)
  2. Environment variables (ugly, fragile)
  3. --install-option (has side effects)

Out of the above, I think (3) is the way to go, because:

  1. It solves the problem for people who want a lean install
  2. It doesn't create any problems for existing users (fully backward compatible)
  3. The side effects don't really matter to us (smart_open does not use wheels)

Thoughts?

I'm not a big fan of --install-option because it's not very used on package install, people usually don't know about this. The common way is pip install smart_open[flavor].

Some more talks about this here: https://stackoverflow.com/questions/8874638/how-to-maintain-pip-install-options-in-requirements-file-made-by-pip-freeze, we need to test if pip freeze command is printing out the --install-option.

Another thing to test, the --install-option in a conda yaml environment file.
From: https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#using-pip-in-an-environment

name: stats2
channels:
  - javascript
dependencies:
  - python=3.6   # or 2.7
  - bokeh=0.9.2
  - numpy=1.9.*
  - nodejs=0.10.*
  - flask
  - pip:
    - Flask-Testing

We should test if adding --install-option to a pip line makes a failure or not.

My last thought is that if (when ?) pypa/setuptools fix the issue, do we will break the compatibility ? (by removing --install-option and replacing it with smart_open[flavor]) - if we agree on "yes" we can go for 3) otherwise I would wait for a fix (or at least a response from pypa/setuptools maintainers).

My last thought is that if (when ?) pypa/setuptools fix the issue, do we will break the compatibility ? (by removing --install-option and replacing it with smart_open[flavor]) - if we agree on "yes" we can go for 3) otherwise I would wait for a fix (or at least a response from pypa/setuptools maintainers).

Is it really necessary to answer that question now? Personally, I'd deal with it when and if that happens.

@mpenkov if my opinion carries any weight, option 3 would be perfectly fine for me :)

I would suggest using extras_require in the setup() call. This is how we do it for our internal packages. We add some code to make it clear how to fix the issue. Airflow has a ton of extras_requires. You can see that here.

try:
    import boto3
except ImportError:
    sys.stderr.write("Do pip install smart_open[aws] to use this module")
    sys.exit(1)

It would obviously be a breaking issue but to my knowledge extras_require is the standard convention for solving this issue. We use pip-tools to build our application requirements.txt files and doing it any other method than the extras_requires would to my knowledge not work with those tools.

The envar solution to me is something I have yet to see in the wild, probably for good reason, and would be a non starter at least imo.

  1. If you're forcing it to be set whatever down stream packages/apps that may or may not use this package will also have to set it.
  2. You're now going to have to validate whatever that envar is set to. (ugly)

If you provide a bad value for extras_require it will give you an error. Try pip install boto3[bad].

This would seem to go well w/ #428 and be like a 2.0 release or however you're versioning.

@mpenkov I'm OK with pip install smart_open[s3]/pip install smart_open[all] stuff

Jus as an FYI this is the requirements.txt generated from a requirements.in file with just this package as a dependency.

#
# This file is autogenerated by pip-compile
# To update, run:
#
#    pip-compile requirements.in
#
boto3==1.12.33            # via smart-open
botocore==1.15.33         # via boto3, s3transfer
cachetools==4.0.0         # via google-auth
certifi==2019.11.28       # via requests
chardet==3.0.4            # via requests
docutils==0.15.2          # via botocore
google-api-core==1.16.0   # via google-cloud-core
google-auth==1.12.0       # via google-api-core, google-cloud-storage
google-cloud-core==1.3.0  # via google-cloud-storage
google-cloud-storage==1.26.0  # via smart-open
google-resumable-media==0.5.0  # via google-cloud-storage
googleapis-common-protos==1.51.0  # via google-api-core
idna==2.9                 # via requests
jmespath==0.9.5           # via boto3, botocore
protobuf==3.11.3          # via google-api-core, googleapis-common-protos
pyasn1-modules==0.2.8     # via google-auth
pyasn1==0.4.8             # via pyasn1-modules, rsa
python-dateutil==2.8.1    # via botocore
pytz==2019.3              # via google-api-core
requests==2.23.0          # via google-api-core, smart-open
rsa==4.0                  # via google-auth
s3transfer==0.3.3         # via boto3
six==1.14.0               # via google-api-core, google-auth, google-resumable-media, protobuf, python-dateutil
smart-open==1.10.0        # via -r requirements.in
urllib3==1.25.8           # via botocore, requests

# The following packages are considered to be unsafe in a requirements file:
# setuptools

It would only get worse with Azure support.

just a suggestion, but might it make some sense to extract the core functionality, name it something like smart-open-core (probably not exactly that, I know the words "open core" are a bit of a sore subject in opensource community ;))?

that repo would have the flavors solely as extras, as was suggested, and this current project would just have smart-open-core[all] as a dependency, for backwards compatibility?

Well @mpenkov your call, I'm fine with this two last solutions !

It seems like pulling them out into a "core" package would cause developer toil. Every add/deletion of functionality would require 2 releases just for the sake of backwards compatibility. Given this doesn't really have to happen now you could provide the extras_require keys in advance so people can prep their code bases for the drop.

core_deps = []
aws_deps = []
gcp_deps = []
azure_deps = []
all_deps = core_deps + aws_deps + gcp_deps + azure_deps

setup(
    install_requires=core_deps + gcp_deps + aws_deps,
    extras_require={
        "aws": aws_deps,
        "azure": azure_deps,
        "gcp": gcp_deps,
        "all": all_deps
    }
)

The first PR to introduce this could be the #228 PR as it will introduce Azure dependencies. If you look at install_requires you'll see I left those out. They would only be available if you do pip install smart_open[azure|all]. Introducing it with the Azure dependencies will avoid those being added. Just give a window of time for people to make the updates.

So with this solution I could change my package today to be smart_open[aws] knowing it will still install everything (except Azure) and then whenever you actually make the backwards compatible change those dependencies just don't get installed.

Like the fix to the backwards compatibility issue seems trivial compared to having to managing 2 pypi packages. Every time you added a new feature you'd have to mimic it's import path in the wrapping package. Same with deletions. Then you gotta version and deploy it. In order to fix the backwards compatibility issue with dependencies is change your install to smart_open[all]. I could understand backwards compatibility being a bigger issue if this was a functional/architectural change but it's not.

As far as my last suggestion goes that would be something I'd release sooner rather than later. You wouldn't even have to include the Azure packages. Just make it a solo release and clearly explain the upcoming breaking change in the changelog.

@Amertz08 well, i'm no packaging expert, so i'll just have to defer to other peoples' expertise on the matter