/ml-toolkit

Primary LanguageHCLApache License 2.0Apache-2.0

Firemind MLOps Toolkit Framework

Report issue . Submit a feature

Table of Contents
  1. About The Project
  2. Initial Setup
    1. Dependencies
    2. Pre-requisites
    3. Deploy IAM Role and OIDC Identity Provider
    4. GitHub workflows
    5. Infrastructure Setup
    6. Infrastructure Setup (Alternative) - CodePipeline
  3. Sample Workflow Execution Details
  4. TL;DR Quick Setup

TL;DR: This framework allows you to get started with an ML landing zone on AWS using native services and SageMaker models. The example in this framework uses SageMaker's prebuilt models (Scikit Learn and TensorFlow) with example data from AWS.

MLOps Toolkit is part of Firemind's Modern Data Strategy tools

Firemind's Modern Data Strategy

Key AWS Services

  • AWS CodePipeline
  • AWS Step Functions
  • AWS Identity and Access Management (IAM) Roles
  • AWS Key Management Service (AWS KMS)
  • Amazon Simple Storage Service (Amazon S3) Buckets
  • AWS Systems Manager Parameter Store (SSM) Parameters
  • AWS Lambda
  • Amazon SageMaker

Infrastructure Diagram

Architecture

Initial Setup

Dependencies

Please complete these steps before continuing.

  1. Install Python 3.9 (This can be done through the installer here (where you should select a stable release for Python 3.9).
  2. Install NodeJs with npm (Install instructions in the official repo here.
  3. Install Poetry here.

Prerequisites

Ensure your CLI has correct credentials to access the AWS account you want this framework deployed to.

To use this framework, create an empty remote repo in your organisation in GitHub, clone a copy of this repo and push to your remote.

Navigate to github-oidc-federation-template-infra.yml file and add a default value for:

  • GitHubOrg: This should be the name of the organisation where your repo exists.
  • FullRepoName: The name of the repo which has a copy of this infrastructure.

Add the following to your remote repository secrets:

  • AWS_REGION: <e.g. eu-west-1>.
  • S3_TERRAFORM_STATE_REGION: <e.g. eu-west-1>.
  • S3_TERRAFORM_STATE_BUCKET: ml-core-<account_id>-state-bucket.
  • ACTION_IAM_ROLE: arn:aws:iam::<account_id>:role/GithubActionsDeployInfra.

Further details can be found

Deploy IAM Role and OIDC Identity Provider

The first step is to deploy a GitHub Actions Role and GitHub OIDC identity provider in the account that allows you to run GitHub actions for the infrastructure.

Note: This only needs to be run once per AWS account. Details on this can be found here: https://github.com/marketplace/actions/configure-aws-credentials-action-for-github-actions

  • Important Note: If an identity provider already exists for your project. Always check that the identity provider exists for your project, which can be found within the AWS IAM console.

Run the following command in the terminal. Can change the stack name and region:

aws cloudformation deploy --template-file github-oidc-federation-template-infra.yml --stack-name app-authorisation-infra-github-authentication --region {{ eu-west-1 }} --capabilities CAPABILITY_IAM --capabilities CAPABILITY_NAMED_IAM

GitHub Workflows

GitHub actions is used to deploy the infrastructure. The config for this can be found in the .gitHub/workflows

We send through a variety of different environment variables

  • BUILD_STAGE - We get this from the branch names.
  • S3_TERRAFORM_STATE_BUCKET - Get this from GitHub secrets.
  • S3_TERRAFORM_STATE_REGION - Get this from GitHub secrets.
  • AWS_REGION - Get this from GitHub secrets.
  • SERVICE - Has default but can be set by user in the .github/workflows files.

Infrastructure Setup

For quick setup follow these instructions:

  • Create an empty repo within your GitHub account.
  • Checkout this repository on development branch to you local drive and push to your remote repo.
  • Assuming the GitHub actions have been set up correctly, the deployment will begin.

If you are having any issues please report a bug via the repo.

Infrastructure Setup (Alternative) - CodePipeline

If GitHub actions is not the preferred CI/CD of choice then you can deploy using CodePipeline. Note: After deployment of the framework you should have two pipelines: The first is the CI/CD pipeline and the second being the framework pipeline.

The Terraform resources for this CI/CD can be found in the cicd folder. The following resources are deployed:

  • AWS CodePipline
  • AWS CodeBuild
  • AWS Identity Access Management (IAM)
  • AWS Key Management Service (KMS)
  • Amazon Simple Storage Service (S3)

Before deployment you must configure some values to ensure the pipeline is connected correctly to your repository.

  1. Firstly in deployment-scripts/cicd-env.sh you must change these values to match your desired Repo and Branch for CI/CD :
    • E.g. TF_VAR_FULL_REPOSITORY_ID="ORG_NAME/REPO_NAME"
    • E.g. TF_VAR_BRANCH="development"
    • Optionally you can also modify TF_VAR_CICD_SERVICE
  2. The following variables can be optionally modified but will work if unchanged. In deployment-scripts/quick-deploy.sh:
    • S3_TERRAFORM_STATE_REGION
    • TF_VAR_SERVICE
    • TF_VAR_BUILD_STAGE
  3. Finally, ensure you have the correct credentials configured in your terminal for deploying into your AWS account.

To deploy CI/CD Codepipeline, run the following command:

bash deployment-scripts/deploy-codepipeline.sh

After deployment, navigate to CodePipeline in your AWS Account and select the 'Settings' > 'Connections' section in the left navigation bar.

There should be a pending 'github-connection' which you must update and connect to your GitHub Repo.

Note: Your GitHub User must have permissions to connect the repo to AWS through CodeStar.

Once complete, the framework pipeline is ready to use and any changes you make to this repo will be deployed though CI/CD.

To destroy the framework, run the following commands in order:

bash deployment-scripts/quick-destroy.sh
bash deployment-scripts/destroy-codepipeline.sh

Sample Workflow Execution Details

The framework is currently setup to work with the following example runbook provided by AWS. More details can be found here.

  1. Once the infrastructure has been deployed, navigate to CodePipeline and Release Change on the pipeline.
  2. The first stage will read the manifest.json file with all the configurations needs for the pipeline to run end to end.
    • The contents of this file is populated during the next stage but a reference to all the keys can be found in config/manifest.json.
  3. The CreateSchema Lambda Function updates the manifest file with resolved values.
    • Details of these values can be found in code/lambda/create_schema under update_payload() function.
  4. A Manual approval stage will prompt the user to upload the source files to the relevant location in S3.
    • The location in S3 will be {workflow_bucket}{timestamp_of_execution}/source.
    • A sample file has been added in sample-files/cal_housing.csv which can be uploaded to the above location.
  5. After approval, the Machine Learning Step Function is invoked with the contents of the manifest file. During this phase the following occurs:
    • Preprocessing of the data (workflow/preprocess.py).
    • Training of data (workflow/train.py).
    • Model Evaluation (workflow/evaluation.py).
    • Model Hosting.
  6. The output for all the steps in the step function can be found in the workflow bucket
  7. CodePipelines next stage will prompt the user to verify the model. This can be done using the output provided by the evaluation step.
  8. After verification, the inference stage begins. The UpdateSchema lambda function is invoked and modifies the manifest payload.
    • Modifications can be found in code/lambda/update_schema under update_payload() function.
  9. Another prompt in CodePipeline will ask the user to verify that the inference data has been uploaded.
    • The location in S3 will be {workflow_bucket}{timestamp_of_execution}/inference
    • Sample file for inference can be found at sample-files/cal_pred.csv which can be uploaded to the above location.
    • Note: This only needs to be uploaded if the EndpointType is configured to Batch instead of SageMaker Endpoint. This can be configured in the update_schema -> update_payload lambda function code.
  10. Finally, the Machine Learning Step Function is invoked with the new payload.
    • Batch Transform Job is invoked.
    • Real Time Endpoint.
  11. The output of inference can be found: {workflow_bucket}{timestamp_of_execution}/inference/out.

TL;DR

Configure your AWS credentials in the CLI with permissions to deploy to your account.

Deploy

bash deployment-scripts/quick-deploy.sh

Destroy

bash deployment-scripts/quick-destroy.sh