Add support for downloading from Azure cloud storage
FlorisCalkoen opened this issue · 2 comments
Edit by @leouieda on 2024-02-19
Add a AzureDownloader
that can fetch the data from Azure cloud storage. It should support an authentication token, ideally with the option to read it from an environment variable.
Original issue 👇🏾
Description of the desired feature:
Would it be possible to add support for fetching data from private cloud containers?
import os
import dotenv
import pooch
import pandas as pd
dotenv.load_dotenv(override=True)
sas_token = os.getenv("AZURE_STORAGE_SAS_TOKEN")
storage_options = {"account_name": "storage_account_name", "account_key": sas_token}
href = "az://some/private/container/file.parquet"
fp = pooch.retrieve(href, known_hash=None, storage_options=storage_options)
pd.read_parquet(fp)
# this currently works for azure, but I'm not sure if its the best approach
href = "az://some/private/container/file.parquet" + sas_token
fp = pooch.retrieve(href, known_hash=None)
pd.read_parquet(fp)
Are you willing to help implement and maintain this feature?
Maybe, yes!
@FlorisCalkoen I just made a custom Downloader like this, but for Google Cloud Storage. If it's useful to you, I linked it in a comment on a similar Issue thread (#363).
Hi @remrama @FlorisCalkoen @WesleyTheGeolien I have 0 experience with cloud containers but since multiple people have requested this than we can look into it.
As @remrama said, this would be best implemented as a downloader. It could take the token as input but could also take a name of an environment variable and do the reading for you.
From what I gather, each cloud would have their own API for fetching the data so they'd need separate implementations. Since Pooch is supposed to be a very lightweight dependency for other projects, any downloader that requires a new dependency would have to make that dependency optional. We already do this for SFTP for example.
I'll edit this issue and #363 to make them explicitly about AWS and Azure. @remrama would you mind opening a new one for Google Cloud Storage and include the link to your code?
If either of you would like to implement this, then it would be great! We'd need:
- A new downloader (
GCSDownloader
,AWSDownloader
,AzureDownloader
) inpooch/downloaders.py
(see https://www.fatiando.org/pooch/latest/downloaders.html and the existing downloaders). Make sure to add it to thechoose_downloader
function so that Pooch can automatically find it based on the prefix (az:
etc). - The test data in our
data
folder uploaded to the storage so we can test that it works. - Tests in
pooch/tests/test_downloaders.py
that check if the download works and that any errors that should be raised are actually raised. - Example documentation, probably in https://www.fatiando.org/pooch/latest/protocols.html
Not sure what the pricing model is for these providers (which is why I never bothered with them) but if it's not possible to have our test data on them so that we can very the functionality then I think it's best to leave the downloader outside of Pooch itself.