How to compile a model available on Hugging Face for AWS Inferentia using the Neuron SDK

Introduction

This repository shows how to compile Foundation Models (FMs) such as Meta-Llama-3-8B-Instruct available on the Hugging Face model hub for Neuron cores using neuron SDK 2.18.1. The compilation process depends on the value of environment variable NEURON_RT_NUM_CORES.

Pre-requisite

The Neuron SDK requires that you compile the model on an Inferentia instance. So this code needs to be run on an Inf2 EC2 instance. The Meta-Llama-3-8B-Instruct was compiled on an inf2.24xlarge instance.
Create an Inf2 based EC2 instance.
1. Use the Hugging Face Neuron Deep Learning AMI (Ubuntu 22.04) AMI for your instance.
2. Use inf2.24xlarge or trn1.32xlarge as the instance type.
3. Have AmazonSageMakerFullAccess policy assigned to the IAM role associated with your EC2 instance. Add the following Trust Relationship added to the IAM role.
```
{
    "Effect": "Allow",
    "Principal": {
        "Service": "sagemaker.amazonaws.com"
    },
    "Action": "sts:AssumeRole"
}
```
You need a valid Hugging Face token to download gated models from the Hugging Face model hub.

It is best to use VSCode to connect to your EC2 instance as we would be running the code from a bash shell.

High level steps

Download and install Conda on your EC2 VM.

Clone this repo on the EC2 VM.

git clone https://github.com/aarora79/compile-llm-for-aws-silicon.git

Create a new conda environment for Python 3.10 and install the packages listed in requirements.txt.

conda create --name awschips_py310 -y python=3.10 ipykernel
source activate awschips_py310;
pip install -r requirements.txt

Change directory to the code repo directory.

Run the download_compile_deploy.sh script using the following command. This script will do a bunch of things:

Download the model from Hugging Face.
Compile the model for Neuron.
Upload the model files to S3.
Create a settings.properties file that refers to the model in S3 and create a model.tar.gz with the settings.properties.
Deploy the model on a SageMaker endpoint.

# replace the model id, bucket name and role parameters as appropriate
hf_token=<your-hf-token>
model_id=meta-llama/Meta-Llama-3-8B-Instruct
neuron_version=2.18
model_store=model_store
s3_bucket="<your-s3-bucket>"
s3_prefix=lmi
region=us-east-1    
batch_size=4
num_neuron_cores=8
ml_instance_type=ml.trn1.32xlarge
role="arn:aws:iam::<your-account-id>:role/<your-role-name>"
./scripts/download_compile_deploy.sh $hf_token \
 $model_id \
 $neuron_version \
 $model_store \
 $s3_bucket \
 $s3_prefix \
 $region \
 $role \
 $batch_size \
 $num_neuron_cores \
 $ml_instance_type> script.log 2>&1

The model is deployed now, note the endpoint name from the SageMaker console and you can use it for testing inference via the SageMaker invoke_endpoint call as shown in infer.py included in this repo, and also, benchmarking performance via the Bring Your Own Endpoint option in FMBench.

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.