aws/sagemaker-huggingface-inference-toolkit

InternalServerException at runtime

krokoko opened this issue · 3 comments

Hi all,

I am trying to run https://huggingface.co/nomic-ai/gpt4all-13b-snoozy on a sagemaker endpoint using the HG inference toolkit. I extended the base DLC container to install a newer version of the transformers library (4.28.0). The endpoint is successfully deployed, however at runtime I get the following error:

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{
  "code": 400,
  "type": "InternalServerException",
  "message": "Could not load model /.sagemaker/mms/models/nomic-ai__gpt4all-13b-snoozy with any of the following classes: (\u003cclass \u0027transformers.models.auto.modeling_auto.AutoModelForSeq2SeqLM\u0027\u003e, \u003cclass \u0027transformers.models.llama.modeling_llama.LlamaForCausalLM\u0027\u003e)."
}

Here is the code I'm using:

# Hub Model configuration. https://huggingface.co/models
hub_snoozy = {
	'HF_MODEL_ID':'nomic-ai/gpt4all-13b-snoozy',
	'HF_TASK':'text2text-generation'
}

# create Hugging Face Model Class
huggingface_model_snoozy = HuggingFaceModel(
        image_uri=ecr_image,
	transformers_version='4.28.0',
	pytorch_version='1.13.1',
	py_version='py39',
	env=hub_snoozy,
	role=role, 
)

predictor_snoozy = huggingface_model_snoozy.deploy(
	initial_instance_count=1, # number of instances
	instance_type='ml.g5.48xlarge', # ec2 instance type,
    endpoint_name='gpt4all-13b-snoozy-text2text-generation',
    container_startup_health_check_timeout=600
)

data = {
"inputs": {
    "question": "What is used for inference?",
    "context": "My Name is Philipp and I live in Nuremberg. This model is used with sagemaker for inference."
    }
}

predictor_snoozy.predict(data)

where ecr-image is my custom container based on 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-inference:1.13.1-transformers4.26.0-gpu-py39-cu117-ubuntu20.04

sagemaker version 2.155.0

I also tested with a different task (textgeneration) and same outcome

Am I doing something wrong ?

Thank you !

@krokoko LLama got added with Transformers 4.28 which is not yet available as DLC but should come in the coming weeks. In the meantime you could create a custom inference script and a requirements.txt to add the latest versions. See here for an example: https://www.philschmid.de/custom-inference-huggingface-sagemaker

Additionally i should mentioned that the default inference container is not doing any form of model parallelism so if the model is not fitting on single GPU you also need a custom script to do this. Here is an example for how we did this for flan-ul2 https://www.philschmid.de/deploy-flan-ul2-sagemaker

@philschmid thanks for the quick reply ! Sorry I wasn't clear, I extended the latest version of the DLC container to install Transformers v4.28.1 and v4.28.0, then uploaded the new containers to my private ECR repo. Something like:

FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-inference:1.13.1-transformers4.26.0-gpu-py39-cu117-ubuntu20.04
RUN pip install --upgrade 'transformers==4.28.1'

That makes things easier since I don't need to create a custom inference script. I use the custom container when instantiating the model (first parameter):

huggingface_model_snoozy = HuggingFaceModel(
        image_uri=ecr_image,
	transformers_version='4.28.0',
	pytorch_version='1.13.1',
	py_version='py39',
	env=hub_snoozy,
	role=role, 
)

Regarding the default inference container, you mean that in that case the custom inference.py script is needed is that correct ?

Most likely yes. But since you update the image there also might be an issue with the model weights. Normally you should be able to load it with the llama class.