Deploy a multi LLM powered chatbot using AWS CDK on AWS

Table of content

Features
Architecture
Security
Precautions
Service Quotas and Preview Access
Deploy
Clean up
Deploying additional models
Credits
License

Features

Comprehensive, ready to use

This sample provides code ready to use so you can start experimenting with different LLMs and prompts.

For additional features provided in this sample such as Amazon SageMaker Foundation models support you are required to request quota increases and preview access.

Multimodel

You have the flexibility to test multiple LLM models concurrently. This unique feature is enabled by a user-friendly web UI, allowing for a experimentation of different models within your own VPC.

AWS Lambda Response Streaming

The sample takes advantage of the newly released AWS Lambda Response Streaming feature, replicating LLM streaming capabilities even with synchronous requests to SageMaker endpoint by querying for completion for small batches of tokens iteratively.


Small batches of tokens predictions	Standard single blocking request

Full-fledged UI

The repository also includes a full-fledged UI built with React to interact with the deployed LLMs as chatbots. It supports both standard requests and streaming modes for hitting LLM endpoint, allows managing conversation history, switching between deployed models, and even features a dark mode for user preference. Hosted on Amazon S3 and distributed with Amazon CloudFront.

LLM sources

This sample comes with a prupose-built CDK Construct, LargeLanguageModel, which helps abstracting 3 different types of model deployments

SageMaker Foundation Models

The sample allows you to deploy models from Amazon SageMaker Foundation models by specifying the model ARN. This simplifies the deployment process of these powerful AI models on AWS.

new LargeLanguageModel(this, 'FoundationModelId', {
  vpc,
  region: this.region,
  model: {
    kind: ModelKind.Package,
    modelId: 'modelId', // i.e. ai21/j2-grande-instruct-v1 - this is an arbitrary ID  
    instanceType: 'instanceType', // i.e. ml.g5.12xlarge
    packages: (scope) =>
      new cdk.CfnMapping(scope, 'ModelPackageMapping', {
        lazy: true,
        mapping: {
          'region': { arn: 'container-arn' }, // ARN usually found in sample notebook from SageMaker foundation page
        },
      }),
    },
});

Hugging Face LLM Inference Container

The solution provides support for all publicly accessible LLMs supported by HuggingFace LLM Inference container, thereby expanding your model options and letting you leverage a wide variety of pre-trained models available on this platform.

new LargeLanguageModel(this, 'HFModel', {
    vpc,
    region: this.region,
    model: {
      kind: ModelKind.Container,
      modelId: 'modelId', // i.e. tiiuae/falcon-40b-instruct - this must match HuggingFace Model ID
      container: ContainerImages.HF_PYTORCH_LLM_TGI_INFERENCE_LATEST,
      instanceType: 'instanceType', // i.e. ml.g5.24xlarge
      env: {
        ...
      },
    },
  });

Models with custom inference

While the options above are preferred, for broader compatibility, the sample also showcases deployment of all other models from Hugging Face not supported by HuggingFace LLM Infernce container using custom inference code. This process is powered by AWS CodeBuild.

For this kind of deployment you need to choose the right container for your model from this list of AWS Deep Learning Containers. Based on PyTorch/Transformers versions, Python version etc.

new LargeLanguageModel(this, 'ModelId', {
  vpc,
  region: this.region,
  model: {
    kind: ModelKind.CustomScript,
    modelId: 'modelId', // i.e. sentence-transformers/all-MiniLM-L6-v2 - this must match HuggingFace Model ID
    codeFolder: 'localFolder', // see for example ./lib/aurora-semantic-search/embeddings-model
    container: 'container-arn', // One from https://github.com/aws/deep-learning-containers/blob/master/available_images.md
    instanceType: 'instanceType', // i.e. g5.12xlarge
    codeBuildComputeType: codebuild.ComputeType.LARGE, // Size of CodeBuild instance. Must have enough storage to download the whole model repository from HuggingFace
  }
});

Semantic search

The sample provides an optional stack to implement a vector database on Amazon Aurora PostgreSQL with pgvector and embeddings.

Allowing Hybrid Searches performed with a combination of Similiary Search and a Full Text Search, which enable an emerging patterns in LLM applications such as "In-Context Learning" (RAG) with automatic document indexing on Amazon S3 upload.

Architecture

Here's an overview of the sample's architecture

VPC Stack

This stack deploys public, private, and isolated subnets. The public subnet is used for the chatbot backend supporting the user interface, the private subnet is used for SageMaker models, and the isolated subnet is used for the RDS database. Additionally, this stack deploys VPC endpoints for SageMaker endpoints, AWS Secrets Manager, S3, and Amazon DynamoDB, ensuring that traffic stays within the VPC when appropriate.

ChatBot Stack

This stack contains the necessary resources to set up a chatbot system, including:

The ability to deploy one or more large language models through a custom construct, supporting three different techniques:
- Deploying models from SageMaker Foundation models by specifying the specific model ARN.
- Deploying models supported by the HuggingFace TGI container.
- Deploying all other models from Hugging Face with custom inference code.
Backend resources for the user interface, including chat backend actions and a Cognito user pool for authentication.
A DynamoDB-backed system for managing conversation history.

This stack also incorporates "model adapters", enabling the setup of different parameters and functions for specific models without changing the core logic to perform requests and consume responses from SageMaker endpoints for different LLMs.

[Optional] Semantic Search Stack

This stack is disabled by default. To enable it update bin/aws-genai-llm-chatbot.ts

An optional semantic search stack that deploys:

A vector database via a custom construct built on top of Amazon Aurora PostgreSQL with pgvector.
An embeddings model on SageMaker to generate embeddings.
Encoders model on SageMaker used to rank sentences by similarity.
An S3 bucket to store documents that, once uploaded, are automatically split up, converted into embeddings, and stored in the vector database.
A Lambda function showcasing how to run hybrid search with pgvector. This function also serves as the entry point for this stack.

[Optional] User Interface

This stack is enabled by default. To disable it update bin/aws-genai-llm-chatbot.ts

A comprehensive UI built with React that interacts with the deployed LLMs as chatbots, supporting sync requests and streaming modes to hit LLM endpoints, managing conversation history, stopping model generation in streaming mode, and switching between all deployed models for experimentation.

Security

This sample underscores the importance of infrastructure security in deploying LLM applications. Here are the key security measures showcased in this sample:

Deployment in Private and Isolated Subnets

The LLM models and vector databases are deployed in private and isolated subnets, providing an additional layer of protection.

Use of VPC Endpoints

VPC endpoints are used for in-VPC traffic, ensuring that traffic that doesn't need to leave the VPC stays within the VPC.

Amazon Cognito Authentication for User Interface

Leverage Amazon Cognito for user interface authentication, ensuring secure access to the chatbot.

Precautions

Before you begin using the sample, there are certain precautions you must take into account:

Cost Management: Be mindful of the costs associated with AWS resources. While the sample is designed to be cost-effective, leaving resources running for extended periods or deploying numerous LLMs can quickly lead to increased costs.
Licensing obligations: If you choose to use any datasets or models alongside the provided samples, ensure you check LLM code and comply with all licensing obligations attached to them.

Service Quotas and Preview Access

No service quota or preview access is needed to start experimenting with the provided sample. However to leverage specific features and for enchanced speed you are currently required to request quota increase and preview access. Specifically:

Instance type quota increase You might consider requesting an increase in service quota for specific SageMaker instance types such as the ml.g5 instance type. This will give access to latest generation of GPU/Multi-GPU instances types. You can do this from the AWS console.
Foundation Models Preview Access If you are looking to deploy models from SageMaker foundation models, you need to request preview access from the AWS console. Futhermore, make sure which regions are currently supported for SageMaker foundation models.

Deploy

Prerequisites

Request quota increase for ml.g5 instances

Most LLMs require instances with 1 or multiple GPUs. If you are deploying models that can run on CPUs only, you can skip this step.

To achieve this you will need to request specific ml.g5 instance types depending on the model deployed.

For example, with Falcon-40B-instrcut you should request ml.g5.24xlarge instance.

For example look for: ml.g5.24xlarge for endpoint usage in case your model needs a ml.g5.24xlarge.

Quota increase can be requested from the AWS console under self-service Service Quotas.

Environment

Verify that your environment satisfies the following prerequisites:

You have:

An AWS account
AdministratorAccess policy granted to your AWS account (for production, we recommend restricting access as needed)
Both console and programmatic access
AWS CLI installed and configured to use with your AWS account
NodeJS 18+ installed
Typescript 3.8+ installed
AWS CDK CLI installed
Docker installed
Python 3+ installed

Prepare CDK

The solution will be deployed into your AWS account using infrastructure-as-code wih the AWS Cloud Development Kit (CDK).

Clone the repository:

git clone https://github.com/aws-samples/aws-genai-llm-chatbot.git

Navigate to this project on your computer using your terminal:

cd aws-genai-llm-chatbot

Install the project dependencies by running this command:

npm install

(Optional) Bootstrap AWS CDK on the target account and regioon

Note: This is required if you have never used AWS CDK before on this account and region combination. (More information on CDK bootstrapping).

npx cdk bootstrap aws://{targetAccountId}/{targetRegion}

Deploy the solution to your AWS Account

Verify that Docker is running with the following command:

docker version

Note: If you get an error like the one below, then Docker is not running and need to be restarted:

Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

Deploy the sample using the following CDK command:

npx cdk deploy --all

Note: This step duration can vary a lot, depending on the model(s) you want to deploy, for example the very first deployment with Falcon-40B can take about 30 minutes.

You can view the progress of your CDK deployment in the CloudFormation console in the selected region.
Once deployed, take note of the UI URL GenAI-ChatBotUIStack.DomainName value

...
Outputs:
GenAI-ChatBotUIStack.DomainName = dxxxxxxxxxxxxx.cloudfront.net
...

Make sure to add a user to the generated Cognito User Pool from GenAI-ChatBotStack in order to be able to access the webapp.

Clean up

You can remove the stacks and all the associated resources created in your AWS account by running the following command:

npx cdk destroy --all

Deploying additional models

As part of this sample, you can find some additional model by uncommenting code in lib/chatbot-stack.ts

Note: We strongly suggest to deploy one new model at the time, since the SageMaker endpoint creations and rollback time can take time.

For new models make sure to select the right type of kind depending on the source of the LLM see related section above.

Create a model adapter by extending the ModelAdapterBase.

You can find examples of model adapters in lib/chatbot-backend/functions/send-message/adapters

export class NewModelAdapter extends ModelAdapterBase {
  /*
    Set up model-specific LancgChain content handler.
  */
  getContentHandler() {
    return new NewModelContentHandler();
  }

  /*
    Method to give ability to generated model-specific prompts
  */
  async getPrompt(args: GetPromptArgs) {
    /* 
      For each request, you will have access to:

      args.prompt 
      string containg the user request
      
      args.context
      if semantic serach is enabled and hybrid search returns text similar to prompt is available here to use as context in the format of 
      [
        'string1',
        'string2',
      ]

      args.history
      list of previous turns of conversation in the format of 
      [
        { sender: 'user/system', 'content': 'message'},
        ...
      ]
      
       Example code:

      const { prompt, history, context } = args;

      const historyString = history.map((h) => `${h.sender}: ${h.content}`).join('\n');
      const contextString = context.length > 0 ? context.join('\n') : 'No context.';

      let completePrompt = `You are a helpful AI assistant. The following is a conversation between you (the system) and the user.\n${historyString || 'No history.'}\n\n`;
      completePrompt += `This is the context for the current request:\n${contextString}\n`;
      completePrompt += `Write a response that appropriately completes the request based on the context provided and the conversastion history.\nRequest:\n${prompt}\nResponse:\n`;

      return completePrompt;

    */
  }

  /*
    Method to define the model-specific stopWords
    stopWords are used as reference to when stop text generations.
  */
  async getStopWords() {
  /*
    Example:

    return ['<|endoftext|>', 'User:', 'Falcon:', '</s>'];
  */
  }
}

Add your Adapter to the model registry.

...
modelAdapterRegistry.add(/^your-regex/, NewModelAdapter);

RegEx expression will allow you to use the same adapter for a different models matching your regex.

For example /^tiiuae\/falcon/ will match all model IDs starting with tiiuae/falcon

Finally, add your model to lib/chatbot-stack.ts

new LargeLanguageModel(this, 'NewModel', {
    vpc,
    region: this.region,
    model: {
      kind: ModelKind.Container, // or the preferred kind
      modelId: 'modelId', // i.e. tiiuae/falcon-40b-instruct, 
      container: ContainerImages.HF_PYTORCH_LLM_TGI_INFERENCE_LATEST,
      instanceType: 'instanceType', // i.e. ml.g5.24xlarge
      env: {
        ...
      },
    },
  });

Credits

This sample was made possible thanks to the following libraries:

langchain from Harrison Chase
pgvector from Andrew Kane

License

This library is licensed under the MIT-0 License. See the LICENSE file.

Changelog of the project.
License of the project.
Code of Conduct of the project.
CONTRIBUTING for more information.

crajah/aws-genai-llm-chatbot