duckdb/dbt-duckdb

Support aws webidentity token credentials when interacting with s3 on duckdb

nclaeys opened this issue · 8 comments

When looking through the code, I noticed that duckdb itself does not support the default aws credentials provider chain. Probably because they do not want to have external dependencies.

I am using dbt-duckdb and for the glue integration you are correctly using the credential provider chain if you do not pass the credentials explicitly. I was wondering if you also want to provide this for the s3 integration?

Next to providing the static credentials through settings properties (as exists at the moment), you could use the boto3 client to get and update credentials when they are expired. These credentials can then be used in the DuckDBConnectionWrapper such that they are always valid and any authentication mechanism can be used.

Example usecase for us:
we use aws irsa for our pods on kubernetes and get the following credentials. Basically a roleArn and a webIdentityToken file, which gets updated every hour. Before I can run dbt commands, I now need to convert these aws credentials to the correct format as is accepted by duckdb as follows:

import boto3

session = boto3.Session()
AWS_ACCESS_KEY_ID = session.get_credentials().access_key
AWS_SECRET_ACCESS_KEY = session.get_credentials().secret_key
AWS_SESSION_TOKEN = session.get_credentials().token

Only then I can execute dbt compile ...
Apart from annoying entrypoint, this has as disadvantage that if my job would take longer than an hour, the credentials would expire...

I think it would be great it the dbt-duckdb project would do this for me. Since you already depend on boto3 the impact is minimal and in line with what you already implemented for the glue integration. What do you think?

@nclaeys ah, interesting! Can you talk me through how this would work? I'm imagining there would need to be a separate set of credential settings provided in the DuckDBCredentials object to signal that we need to use the AWS webidentity credentials, and when we create a connection, we would need to run some code to generate the credentials that DuckDB expects to use to pass to S3 (similar to what you're doing above?)

I think that the best approach is to not depend too much on the type of credentials users pass, maybe use a property use_credential_provider_chain=true or somethng. In this case just use boto3 to create a session: boto3.Session(). Since that uses the credentials provider chain, you will immediately support all ways of authenticating and getting a session. From that session, you can then use similar code as I wrote before to get the ACCESS_KEY_ID, SECRET_ACCESS_KEY and SESSION_TOKEN and add it to the cursor before executing a query.

I should investigate if the boto3.session handles recreates if it expires or if that is up to us to handle ourselves.

Does that make sense?

Roughly, it makes sense? This isn't a toolchain I've used before so I'm not especially familiar with how it's wired together.

I'm assuming there are equivalent concepts to this for GCP and Azure? I don't want to play favorites on cloud providers and I'd like to make such a thing as generic as possible.

Actually @hamilton was running into a similar issue earlier with GCP creds (he mentioned it in the data tools discord we hang out in) Is there a way I could make the experience of using your GCP service credentials with DuckDB more pleasant?

I fully understand that. The environment variables you set for using s3 with duckdb are AWS specific and boto3 is specific to aws.
I do not know gcp but our product is deployed on Azure and AWS. Credentials in both clouds work differently and are difficult to unify.
In Azure you have at the moment 2 ways for working with blob storage:

  • static key, secret
  • workload identity which is a similar concept as irsa on AWS.

I think that for this library it is best/easiest to try to abstract away from the cloud provider as much as possible. Most of them have a similar approach when using their cli or getting an authenticated session which means that you do not have to worry too much about the different mechanisms they use (when using a container on kubernetes, a vm, running through cli locally,...).

DuckDB's naming convention is clearly AWS-centric, but since everyone implements S3 as an API, you can change the s3_endpoint setting to make it work with GCP etc. https://duckdb.org/docs/guides/import/s3_import.html

Ah nice that that also works on gcp, I fear it does not on Azure though, I will check this later.

If both gcp and AWS work like this you could create an interface for the credentials, just exposing access_key, secret_key, token. Then use somthing similar to the glue extension to detect the correct credentials depending on the cloud.

dbt-duckdb[aws] or dbt-duckdb[gcloud] which loads boto3 and the equivalent for gcloud to get the necessary credentials...

Still very rough, need to test if it actually does what I want but this commit contains the idea I had:
nclaeys@32d6ddc