Scripts and Infrastructure as Code (IaC) used for the talk titled "Effective MLOps with Kubeflow on AWS". It sets up all the resources needed by the Kubeflow on AWS distribution.
node >= 16.15.1
python >= 3.8
aws-cdk >= 2.28.1
$ make install
$ cd infrastructure && npm install
This is a one-time command per region (e.g., for your choice like eu-west-1
and us-east-1
):
$ cd infrastructure
$ npm run bootstrap
After that (or if you already have bootstrapped AWS CDK in that account and region), we can deploy shared resources:
$ npm run deploy-shared-regional-infrastructure
$ npm run deploy-shared-global-infrastructure
Now we can push the code to the AWS CodeCommit repository that will be used in the 4th step (it's an optional step, see below):
# Push the code to the newly created repository in AWS CodeCommit via regional infrastructure:
$ git remote add aws "...<PUT HERE AWS CODECOMMIT URL FROM CDK DEPLOY OUTPUT>..."
$ git push aws main
Then we can deploy the rest of the stuff:
$ npm run deploy
When it comes to the IDE and working environment stack - it's used just for demonstration purposes - to prepare a unified environment for the Kubeflow configuration and deployment. If you have all prerequisites available locally, you are free to skip this particular stack. See also .tool-versions file for exact versions used in this project.
A few things needs to be done to use AWS Cloud 9 environment properly:
- Disable Temporary AWS Credentials in the Preferences inside AWS Settings tab.
- In the same place, change Auto-Saving Files on focus change in the Experimental tab.
- Show Hidden Files in the sidebar inside AWS Cloud 9 IDE (click the gear icon on the sidebar).
If you are not using AWS Cloud 9 from 4th step, please make sure you've cloned awslabs/kubeflow-manifests locally as it is requested here.
Now, we can proceed with the deployment steps.
As a first step we have to log in to the cluster and update-kubeconfig
(you can find that command in the output values of AWS CloudFormation stack with EKS):
$ aws eks update-kubeconfig --name shared-eks-cluster --region eu-west-1 --role-arn arn:aws:iam::...
$ kubectl create namespace kubeflow
$ kubectl create namespace amazon-cloudwatch
Then, we need to create IRSA for AWS SSM and AWS Secrets Manager, and then install Kubernetes Secrets Store CSI Driver:
$ eksctl create iamserviceaccount --name kubeflow-secrets-manager-sa --namespace kubeflow --cluster ${CLUSTER_NAME} --attach-policy-arn arn:aws:iam::aws:policy/AmazonSSMReadOnlyAccess --attach-policy-arn arn:aws:iam::aws:policy/SecretsManagerReadWrite --override-existing-serviceaccounts --approve --region ${AWS_REGION}
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/secrets-store-csi-driver/v1.0.0/deploy/rbac-secretproviderclass.yaml
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/secrets-store-csi-driver/v1.0.0/deploy/csidriver.yaml
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/secrets-store-csi-driver/v1.0.0/deploy/secrets-store.csi.x-k8s.io_secretproviderclasses.yaml
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/secrets-store-csi-driver/v1.0.0/deploy/secrets-store.csi.x-k8s.io_secretproviderclasspodstatuses.yaml
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/secrets-store-csi-driver/v1.0.0/deploy/secrets-store-csi-driver.yaml
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/secrets-store-csi-driver/v1.0.0/deploy/rbac-secretprovidersyncing.yaml
$ kubectl apply -f https://raw.githubusercontent.com/aws/secrets-store-csi-driver-provider-aws/main/deployment/aws-provider-installer.yaml
Next, we will create IRSA for Application Load Balancer (ALB), but first - we need add a missing tag (kubernetes.io/cluster/<CLUSTER_NAME>
):
$ export TAG_VALUE=owned
$ export CLUSTER_SUBNET_IDS=$(aws ec2 describe-subnets --region ${AWS_REGION} --filters Name=tag:aws:cloudformation:stack-name,Values=KubeflowOnAWS-Shared-Infrastructure --output json | jq -r '.Subnets[].SubnetId')
$ for i in "${CLUSTER_SUBNET_IDS[@]}"; do aws ec2 create-tags --resources ${i} --tags Key=kubernetes.io/cluster/${CLUSTER_NAME},Value=${TAG_VALUE}; done
$ cd ../../kubeflow-manifests
$ export LBC_POLICY_NAME="alb_ingress_controller_${AWS_REGION}_${CLUSTER_NAME}"
$ export LBC_POLICY_ARN=$(aws iam create-policy --policy-name ${LBC_POLICY_NAME} --policy-document file://./awsconfigs/infra_configs/iam_alb_ingress_policy.json --output text --query 'Policy.Arn')
$ eksctl create iamserviceaccount --name aws-load-balancer-controller --namespace kube-system --cluster ${CLUSTER_NAME} --region ${AWS_REGION} --attach-policy-arn ${LBC_POLICY_ARN} --override-existing-serviceaccounts --approve
Last, but not least: IRSA for Amazon CloudWatch and Fluent Bit:
$ eksctl create iamserviceaccount --name cloudwatch-agent --namespace amazon-cloudwatch --cluster ${CLUSTER_NAME} --region ${AWS_REGION} --approve --override-existing-serviceaccounts --attach-policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy
$ eksctl create iamserviceaccount --name fluent-bit --namespace amazon-cloudwatch --cluster ${CLUSTER_NAME} --region ${AWS_REGION} --approve --override-existing-serviceaccounts --attach-policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy
Then, we will install Fluent Bit that is integrated with Amazon CloudWatch:
$ export FluentBitHttpPort='2020'
$ export FluentBitReadFromHead='Off'
$ [[ ${FluentBitReadFromHead}='On' ]] && export FluentBitReadFromTail='Off' || export FluentBitReadFromTail='On'
$ [[ -z ${FluentBitHttpPort} ]] && export FluentBitHttpServer='Off' || export FluentBitHttpServer='On'
$ curl https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/quickstart/cwagent-fluent-bit-quickstart.yaml | sed 's/{{cluster_name}}/'${CLUSTER_NAME}'/;s/{{region_name}}/'${AWS_REGION}'/;s/{{http_server_toggle}}/"'${FluentBitHttpServer}'"/;s/{{http_server_port}}/"'${FluentBitHttpPort}'"/;s/{{read_from_head}}/"'${FluentBitReadFromHead}'"/;s/{{read_from_tail}}/"'${FluentBitReadFromTail}'"/' | kubectl apply -f -
Now, we have to configure all the settings:
$ export REGIONAL_CERT_ARN="<YOUR_REGIONAL_WILDCARD_CERTIFICATE_ARN>"
$ printf 'clusterName='$CLUSTER_NAME'' > ./awsconfigs/common/aws-alb-ingress-controller/base/params.env
$ printf 'certArn='$REGIONAL_CERT_ARN'' > ./awsconfigs/common/istio-ingress/overlays/https/params.env
$ printf 'loadBalancerScheme=internet-facing' > ./awsconfigs/common/istio-ingress/base/params.env
$ export CognitoUserPoolArn="<YOUR_USER_POOL_ARN>"
$ export CognitoAppClientId="<YOUR_APP_CLIENT_ID>"
$ export CognitoUserPoolDomain="<YOUR_USER_POOL_DOMAIN>"
$ export signOutURL="<YOUR_SIGN_OUT_URL>"
$ export CognitoLogoutURL="https://$CognitoUserPoolDomain/logout?client_id=$CognitoAppClientId&logout_uri=$signOutURL"
$ printf '
CognitoUserPoolArn='$CognitoUserPoolArn'
CognitoAppClientId='$CognitoAppClientId'
CognitoUserPoolDomain='$CognitoUserPoolDomain'
certArn='$REGIONAL_CERT_ARN'
' > ./awsconfigs/common/istio-ingress/overlays/cognito/params.env
$ printf 'LOGOUT_URL='$CognitoLogoutURL'' > ./awsconfigs/common/aws-authservice/base/params.env
$ export RDS_SECRET="<YOUR_RDS_SECRET_ARN>"
$ yq e -i '.spec.parameters.objects |= sub("rds-secret",env(RDS_SECRET))' ./awsconfigs/common/aws-secrets-manager/rds/secret-provider.yaml
$ export S3_SECRET="<YOUR_S3_SECRET_ARN>"
$ yq e -i '.spec.parameters.objects |= sub("s3-secret",env(S3_SECRET))' ./awsconfigs/common/aws-secrets-manager/s3/secret-provider.yaml
$ export DATABASE_HOST="<YOUR_RDS_HOSTNAME>"
$ printf '
dbHost='${DATABASE_HOST}'
mlmdDb=metadata_db
' > ./awsconfigs/apps/pipeline/rds/params.env
$ export BUCKET_NAME="<YOUR_BUCKET_NAME>"
$ printf '
bucketName='${BUCKET_NAME}'
minioServiceHost=s3.amazonaws.com
minioServiceRegion='${AWS_REGION}'
' > ./awsconfigs/apps/pipeline/s3/params.env
And we can finally build and apply all the changes:
$ while ! kustomize build ./docs/deployment/cognito-rds-s3 | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done
Last, but not least:
- We can update the placeholder DNS record in a custom domain with the ALB alias, and we can create
*.platform.${ROOT_DOMAIN}
entry with address from the ingress.kubectl get ingress -n istio-system
- If you don't see any ALB being created, and errors in the logs are saying something about AccessDenied make sure that ServiceAccount for the ALB Controller has proper annotation to the IAM Role created in the IRSA step. Something along those lines:
-
apiVersion: v1 kind: ServiceAccount metadata: annotations: eks.amazonaws.com/role-arn: arn:aws:iam::...
-
- If you don't see any ALB being created, and errors in the logs are saying something about AccessDenied make sure that ServiceAccount for the ALB Controller has proper annotation to the IAM Role created in the IRSA step. Something along those lines:
- Create a user in a Cognito user pool.
- Remember the password used for creating a user, it will be used later when logging in to Kubeflow UI.
- Create a profile for the user from the user pool.
-
apiVersion: kubeflow.org/v1beta1 kind: Profile metadata: # Replace with the name of profile you want, this will be user's namespace name. name: namespace-for-my-user namespace: kubeflow spec: owner: kind: User # Replace with the email of the user. name: your@email.com
-
- And finally, you can log in to the central dashboard on
https://kubeflow.platform.${ROOT_DOMAIN}/
.
I am Wojciech Gawroński (AWS Maniac) - in case of any questions, you can drop me a line over email.