fatiando/pooch

Add support for downloading from Google Cloud Storage

remrama opened this issue · 5 comments

Add a GCSDownloader that can fetch the data from Google Cloud Storage. It should support an authentication token, ideally with the option to read it from an environment variable.

See matched feature requests for other cloud storage services from Amazon's AWS (#363) and Microsoft's Azure (#382).

This would require:

  • A new downloader (GCSDownloader) in pooch/downloaders.py (see https://www.fatiando.org/pooch/latest/downloaders.html and the existing downloaders). Make sure to add it to the choose_downloader function so that Pooch can automatically find it based on the prefix (gs).
  • The test data in our data folder uploaded to the storage so we can test that it works.
  • Tests in pooch/tests/test_downloaders.py that check if the download works and that any errors that should be raised are actually raised.
  • Example documentation, probably in https://www.fatiando.org/pooch/latest/protocols.html

I've got a fully functional GCSDownloader class here in a fork, but minus the testing. It uses the google-cloud-storage package for authentication/downloading, which can be passed as a token to the downloader or read from an environment variable. It allows usage of the tqdm progress bar option.

# Authorize by setting an environment variable
import os
import pooch
credentials = "google_app_credentials.json"
url = "gs://bucket_name/blob_name.txt"
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = credentials
filename = pooch.retrieve(url, known_hash=None)

# Authorize by passing credentials to custom downloader
from pooch import GCSDownloader
credentials = "google_app_credentials.json"
downloader = GCSDownloader(credentials=credentials)
filename = pooch.retrieve(url, known_hash=None, downloader=downloader)

I can't speak to long-term maintenance, but I would be interested in adding tests and submitting a PR within the next month.

Thanks @remrama! We'd be happy to have this.

Hi @remrama we've been thinking quite a lot about whether we should add this to pooch itself or if it would be better as a separate project that provides the downloaders only.

I think the main issue is testing all of this. With Zenodo and figshare we can be pretty certain that the test data will stay there in the long term. But with all of these cloud providers, can we trust that our test data will be there? Can we update it without the original uploader? Is it free?

I don't use these cloud storages so I don't the answer to these.

@leouieda we are on the exact same page. My hold-up on this was all about trying to come up with the best way to run the testing. I don't think there's a way to properly run tests without using a private google account (including fees for the calls, even if small).

I was sitting on it, thinking a solution might pop up, and in the meantime I've been playing around with the Zenodo downloader. I've become very appreciate of this feature in pooch. I find the Zenodo (and figshare) downloaders to be incredibly convenient. And for people who are trying to download datasets that they can't make public, Zenodo even offers private repos. While this won't solve everyone's needs, I think the current DOIdownloaders are sufficient for the practical minimalism of pooch.

I vote to exclude this feature. I think I'll just keep my GCSDownloader in a public fork or even just a Gist file and pull it down whenever I need it. At most, maybe you'd want to add an example in the docs showing this approach, but even that I'm not so sure about.

@remrama good to know! I also use the DOI downloaders quite a lot myself.

I was speaking with @santisoler about possibly creating some form of plugin system for Pooch downloaders. The idea would be that other packages can implement custom downloaders associated with different protocols and Pooch could find them and hook them up to the machinery that matches protocols in URLs to downloader classes. But this is a bit beyond what I have time for lately.

In the mean time, if you want help distributing your GCSDownloader class as a standalone package, we can help with that.

Sounds good, thanks. I'm not so familiar with the plug-in system, but it sounds like a good idea for this feature. As for my current plans for implementing the GCSDownloader, I don't really have one right now. The existing DOIDownloader has been satisfying all my needs. If I'm in need of a more accessible GCSDownloader again, I'll probably look back into these more convenient packaging options and revive this idea. It's not so far-fetched. I imagine I will need it at some point, just not so sure how soon.