/smart_open

Utils for streaming large files (S3, HDFS, gzip, bz2...)

Primary LanguagePythonMIT LicenseMIT

smart_open -- utils for streaming large files

Travis_ Downloads_ License_

What?

smart_open is a Python 2 & Python 3 library for efficient streaming of very large files from/to S3, HDFS, WebHDFS or local (compressed) files. It is well tested (using moto), well documented and sports a simple, Pythonic API:

Since going over all (or select) keys in an S3 bucket is a very common operation, there's also an extra method smart_open.s3_iter_bucket() that does this efficiently, processing the bucket keys in parallel (using multiprocessing):

For more info (S3 credentials in URI, minimum S3 part size...) and full method signatures, check out the API docs:

Why?

Working with large S3 files using Amazon's default Python library, boto, is a pain. Its key.set_contents_from_string() and key.get_contents_as_string() methods only work for small files (loaded in RAM, no streaming). There are nasty hidden gotchas when using boto's multipart upload functionality, and a lot of boilerplate.

smart_open shields you from that. It builds on boto but offers a cleaner API. The result is less code for you to write and fewer bugs to make.

Installation

The module has no dependencies beyond Python >= 2.6 (or Python >= 3.3) and boto:

pip install smart_open

Or, if you prefer to install from the source tar.gz:

python setup.py test  # run unit tests
python setup.py install

To run the unit tests (optional), you'll also need to install mock , moto and responses <https://github.com/getsentry/responses> (pip install mock moto responses). The tests are also run automatically with Travis CI on every commit push & pull request.

Todo

smart_open is an ongoing effort. Suggestions, pull request and improvements welcome!

On the roadmap:

  • better documentation for the default file:// scheme

Comments, bug reports

smart_open lives on github. You can file issues or pull requests there.


smart_open is open source software released under the MIT license. Copyright (c) 2015-now Radim Řehůřek.