e2fyi/kubeflow-aws

Deployment of ml-pipeline not working with IAM roles (kube2iam)

prcastro opened this issue · 28 comments

I used the IAM overlay to configure my Kubeflow deployment to use my S3 bucket. Just configuring the bucket, prefix, region and role, I get this for the ml-pipeline deployment:

$kubectl logs -n kubeflow ml-pipeline-7cd7f6678d-hm89c -f
I0115 14:54:17.589476       8 client_manager.go:127] Initializing client manager
[mysql] 2020/01/15 14:54:18 packets.go:427: busy buffer
[mysql] 2020/01/15 14:54:18 packets.go:408: busy buffer
E0115 14:54:18.517982       8 default_experiment_store.go:73] Failed to commit transaction to initialize default experiment table
[mysql] 2020/01/15 14:54:18 packets.go:427: busy buffer
[mysql] 2020/01/15 14:54:18 packets.go:408: busy buffer
E0115 14:54:18.519865       8 db_status_store.go:71] Failed to commit transaction to initialize database status table
[mysql] 2020/01/15 14:54:18 packets.go:427: busy buffer
[mysql] 2020/01/15 14:54:18 packets.go:408: busy buffer
E0115 14:54:18.521396       8 default_experiment_store.go:73] Failed to commit transaction to initialize default experiment table
F0115 14:55:02.668289       8 client_manager.go:311] Failed to create Minio bucket. Error: Get http://s3.amazonaws.com:443/my-bucket/?location=: net/http: HTTP/1.x transport connection broken: malformed HTTP response "\x15\x00\x00\x00\x02\x01\x00"

I also tried changing the port to 80, but the following error appeared:

I0115 16:14:22.433511       8 client_manager.go:127] Initializing client manager
[mysql] 2020/01/15 16:14:22 packets.go:427: busy buffer
[mysql] 2020/01/15 16:14:22 packets.go:408: busy buffer
E0115 16:14:22.672748       8 default_experiment_store.go:73] Failed to commit transaction to initialize default experiment table
[mysql] 2020/01/15 16:14:22 packets.go:427: busy buffer
[mysql] 2020/01/15 16:14:22 packets.go:408: busy buffer
E0115 16:14:22.677105       8 db_status_store.go:71] Failed to commit transaction to initialize database status table
[mysql] 2020/01/15 16:14:22 packets.go:427: busy buffer
[mysql] 2020/01/15 16:14:22 packets.go:408: busy buffer
E0115 16:14:22.680739       8 default_experiment_store.go:73] Failed to commit transaction to initialize default experiment table
F0115 16:14:22.699166       8 client_manager.go:311] Failed to create Minio bucket. Error: The AWS Access Key Id you provided does not exist in our records.

The default configuration seem to work without problems. My guess is that the ml-pipeline doesn't support IAM auth. I was lead to believe this since it instances the MinIo client explicitly passing the accessKey and the secretKey.

https://github.com/kubeflow/pipelines/blob/dc34a3568d79dd96c908703869596dcf6514bf52/backend/src/apiserver/client/minio.go#L29-L30

Do you know if ml-pipelines supports IAM? If so, how do you achieved IAM authentication in your cluster?

Hi.
You are absolutely right. I missed this.

I haven't tried out this set of manifest because the actual manifest I used for our clusters are inherited from the main kubeflow manifest with significant modifications specific for our clusters.

I haven't switch ml-pipeline to IAM yet because of our strict bucket policy (the pipeline folder is hardcoded, and it doesn't meet our policies). I am still running minio service to store the pipeline templates, until my PR to update ml-pipeline is approved and merged. But IAM shld work for the rest.

Note that there is a bug in ml-pipeline-ui which I just fixed in v0.1.40, where the IAM session token is not refreshed.

But otherwise shld work unless I missed out something.

This is a good catch. I will update my PR to use the credential provider chain instead.

I probably can provide a forked image with the change if u need. Cuz I think this will take sometime as they seems to be quite bz to review my PR.

U can track my pr kubeflow/pipelines#2080

Can this bug on the ml-pipeline-ui prevent the metrics to appear on the interface?

@prcastro

ml-pipeline-ui

It should work for the first time. But will fail after the session expires. This applies to any S3 artifacts retrieved from the UI (aka argo artifacts and pod logs).

However, if u mean metrics from metadata server, then it will not be affected as it is stored in a database, not in s3. Only stuff stored in s3 (i.e. minio) uses my modified client.

But this shld be fixed in 1.40.0. We have added unit test and refracted the code abit to be cleaner.

ml-pipeline

U can check out https://hub.docker.com/repository/docker/e2forks/ml-pipeline
which is my forked build of the ml-pipeline with the new fix.

I have updated the PR to use a chained provider credentials. It will use first the API key in config.json -> minio env var -> aws env var -> IAM.

I have setup an automated build for this forked branch (for the PR). U shld see the build soon.

If u tried it, pls do tell me if this fix the issue? I also added the flags for region and secure to the minio client.

@prcastro

e2forks/ml-pipeline:iam should work now. However, because config.json has priority over env variables, u need to change the manifest slightly. U need to create a configmap to overwrite the default config.json so that the accesskey will not be used (so that it can rollback to IAM).

I will make a commit later to update the manifest.

I performed some changes to the kustomize (changed the MINIO_SERVICE_SERVICE_PORT env var, the image and config.json file) and the final product of the ml-pipeline was this:

apiVersion: apps/v1beta2
kind: Deployment
metadata:
  annotations:
    iam.amazonaws.com/role: my-role
  labels:
    app: ml-pipeline
  name: ml-pipeline
  namespace: kubeflow
spec:
  selector:
    matchLabels:
      app: ml-pipeline
  template:
    metadata:
      annotations:
        iam.amazonaws.com/role: my-role
      labels:
        app: ml-pipeline
    spec:
      containers:
      - env:
        - name: OBJECTSTORECONFIG_BUCKETNAME
          value: my-bucket
        - name: MINIO_SERVICE_SERVICE_HOST
          value: s3.amazonaws.com
        - name: MINIO_SERVICE_SERVICE_PORT
          value: "80"
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        image: e2forks/ml-pipeline:iam
        imagePullPolicy: IfNotPresent
        name: ml-pipeline-api-server
        ports:
        - containerPort: 8888
        - containerPort: 8887
        volumeMounts:
        - mountPath: /config/config.json
          name: config-volume
          subPath: config.json
      serviceAccountName: ml-pipeline
      volumes:
      - configMap:
          name: ml-pipeline-config
        name: config-volume

While the ml-pipeline-config is a ConfigMap defined as:

apiVersion: v1
data:
  config.json: |
    {
        "DBConfig": {
            "DriverName": "mysql",
            "DataSourceName": "",
            "DBName": "mlpipeline",
            "GroupConcatMaxLen": "4194304"
        },
        "ObjectStoreConfig": {
            "AccessKey": "minio",
            "SecretAccessKey": "minio123",
            "BucketName": "mlpipeline",
            "PipelineFolder": "pipelines"
        },
        "InitConnectionTimeout": "6m",
        "DefaultPipelineRunnerServiceAccount": "pipeline-runner"
    }
kind: ConfigMap
metadata:
  name: ml-pipeline-config
  namespace: kubeflow

The result was basically the same problem:

I0116 22:03:31.753685       7 client_manager.go:136] Initializing client manager
I0116 22:03:31.753815       7 config.go:45] Config DBConfig.ExtraParams not specified, skipping
[mysql] 2020/01/16 22:03:31 packets.go:427: busy buffer
[mysql] 2020/01/16 22:03:31 packets.go:408: busy buffer
E0116 22:03:31.888743       7 default_experiment_store.go:73] Failed to commit transaction to initialize default experiment table
[mysql] 2020/01/16 22:03:31 packets.go:427: busy buffer
[mysql] 2020/01/16 22:03:31 packets.go:408: busy buffer
E0116 22:03:31.891798       7 db_status_store.go:71] Failed to commit transaction to initialize database status table
[mysql] 2020/01/16 22:03:31 packets.go:427: busy buffer
[mysql] 2020/01/16 22:03:31 packets.go:408: busy buffer
E0116 22:03:31.894787       7 default_experiment_store.go:73] Failed to commit transaction to initialize default experiment table
F0116 22:03:31.911792       7 client_manager.go:342] Failed to create Minio bucket. Error: The AWS Access Key Id you provided does not exist in our records.

Am I missing something?

The config.json has precedent over environment variables.

U need to set the access key to an empty str, before u can use IAM. Because if the accessKey is set, it will use that instead.

IAM is the last fallback.

U can now also set the port to be an empty str. And set the protocol via the MINIO_SERVICE_SECURE flag.

Sorry about that! I fixed those issues and now it is working. The only problem I'm getting now is opening an successfully executed operation in a pipeline, in the Inputs/Outputs tab, if I click a link to s3 I get the following error:

Failed to get object in bucket my-bucket at path runs/743f72cc-c331-4548-81b8-3fcd612c552a/my-pipeline/my-op-metrics.tgz: S3Error: Access Denied

How makes this request? The ml-pipeline-ui? Because I found the following code:

https://github.com/kubeflow/pipelines/blob/master/frontend/server/handlers/artifacts.ts#L133

But it seems that the UI is already using IAM roles to authenticate

Checking the ml-pipeline-ui logs, it is indeed receiving a request:

...
GET /pipeline/artifacts/get?source=s3&bucket=my-bucket&key=runs%2F743f72cc-c331-4548-81b8-3fcd612c552a%2Fmy-pipeline%2Fmy-op-metrics.tgz
Getting storage artifact at: s3: my-bucket/runs/743f72cc-c331-4548-81b8-3fcd612c552a/my-pipeline/my-op-metrics.tgz

Did u set the access key for UI to be empty? Because it follows the same behavior. If minio access key is provided, it will use it.

    MINIO_ACCESS_KEY = ''
    MINIO_SECRET_KEY = ''

Cuz by default, it is provided.

I ssh'ed into the UI container and those are the MINIO_* env vars that I found:

/server # env | grep MINIO
MINIO_SERVICE_PORT_9000_TCP=...
MINIO_SERVICE_PORT=...
MINIO_SERVICE_SERVICE_PORT=9000
MINIO_NAMESPACE=kubeflow
MINIO_SERVICE_PORT_9000_TCP_ADDR=...
MINIO_SERVICE_PORT_9000_TCP_PORT=...
MINIO_SERVICE_PORT_9000_TCP_PROTO=...
MINIO_SERVICE_SERVICE_HOST=...

I also didn't find any AWS_* env vars there. Also, I checked on the file produced by Kustomize, and none of those env vars appear on the UI Deployment. Anyway, I'll try to manually set that and see how it goes.

Tested setting the env vars (both the MinIO ones and the AWS ones) and the problem persists. The IAM role the UI is using is the same that I used to write the artifacts there, so it should work.

Sorry, I was confused. UI handled it differently. There are minio and AWS config. And yes, by default AWS config is empty so it will fallback to IAM.

Can u check if the file is actually saved to the bucket?

And does ur IAM permission has getObject permission?

Although the path looks suspiciously wrong. The key for the artifact. Cuz it usually shld have a folder before it, instead of starting with a run id.

Ok I think I found the bug. It is in ml pipelines.

I introduce some changes which broke the path for some other part of the code.

So the path are resolved wrongly.

Let me see if I can fix it. But meanwhile u can look in ur s3 bucket and try querying UI with the correct key?

Ok I think I fixed it. Building the image and trying again.

But meanwhile u can look in ur s3 bucket and try querying UI with the correct key?

The path on the s3 bucket seems to be just fine:

$ aws s3 ls s3://my-bucket/runs/743f72cc-c331-4548-81b8-3fcd612c552a/my-container/my-op-metrics.tgz

2020-01-16 20:06:38        160 my-op-metrics.tgz

This is exactly the same path that appears on the UI

Testing the new image

Same error is happening. I tried to check if the kube2iam pod on the same node was logging when ml-pipeline-ui requested credentials, but doesn't seem to happen (even for the ml-pipeline requests, which work).

I also checked and the role seem to allow getObject on this bucket.

Tomorrow I'll try to debug it further.

@prcastro I just updated the manifest to v0.2.3 of kubeflow pipelines.

The IAM should work now, as my PR has went in.

I tried to test, but I'm struggling with #7 . I'll wait for a fix and then I'll try again.

Sorry for the trouble. This is what happens when u code 24 hrs straight.
Finally fixed everything.

Bugs fixed:

  • ml-pipeline API endpoint is properly set in UI

  • metadata envoy endpoint is set in UI

  • Argo is configured to be namespace mode instead of cluster mode

  • docker entry point is fixed

  • MySQL service selector fixed

Now default namespace is also set to be kubeflow.

We have configured this setup using kube2iam for multiple namespaces on our EKS cluster. Let us know if you still face any issues.

@lzuwei why do you use kube2iam on EKS? Doesn't the service account IAM support meet your need?

Mostly because iam service account for eks was only introduced like a few months ago.

We probably should migrate, but it is not high priority at the moment.

It shld be trivial for u to adapt this repo for IAM service account. U just need to annotate the ml-pipeline-ui and |ml-pipeline` service account. And update the tensorboard pod definition template with the appropriate service account.

If I can find time, I will add 1 more overlay for IAM service account. U are welcome to make a PR too.