aws/sagemaker-huggingface-inference-toolkit

InternalServerException while deploying HuggingFace model on SageMaker

KartikKannapur opened this issue · 6 comments

Background

We are working to deploy the AllenAI Cosmo XL model - https://huggingface.co/allenai/cosmo-xl for inference on SageMaker.

Our Approach

We are using a SageMaker notebook.
Instance: ml.p3.2xlarge
Kernel: conda_pytorch_p39

Code is directly from the HuggingFace website here: https://huggingface.co/allenai/cosmo-xl
We select Deploy -> Amazon SageMaker
Select Task=Conversational and Configuration=AWS
and copied the code into our notebook.

Error

The model is created and deployed.
But, when we run predictor.predict for model inference, we run into the following error

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{
  "code": 400,
  "type": "InternalServerException",
  "message": "Could not load model /.sagemaker/mms/models/allenai__cosmo-xl with any of the following classes: (\u003cclass \u0027transformers.models.auto.modeling_auto.AutoModelForSeq2SeqLM\u0027\u003e, \u003cclass \u0027transformers.models.auto.modeling_auto.AutoModelForCausalLM\u0027\u003e, \u003cclass \u0027transformers.models.t5.modeling_t5.T5ForConditionalGeneration\u0027\u003e)."
}

Relevant error messages from CloudWatch

2023-03-15T19:53:32,781 [INFO ] W-allenai__cosmo-xl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Prediction error
2023-03-15T19:53:32,783 [INFO ] W-allenai__cosmo-xl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.8/site-packages/sagemaker_huggingface_inference_toolkit/handler_service.py", line 219, in handle
2023-03-15T19:53:32,784 [INFO ] W-allenai__cosmo-xl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.8/site-packages/transformers/pipelines/__init__.py", line 549, in pipeline
2023-03-15T19:53:32,785 [INFO ] W-allenai__cosmo-xl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     raise ValueError(f"Could not load model {model} with any of the following classes: {class_tuple}."
2023-03-15T19:53:32,785 [INFO ] W-allenai__cosmo-xl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ValueError: Could not load model /.sagemaker/mms/models/allenai__cosmo-xl with any of the following classes: (<class 'transformers.models.auto.modeling_auto.AutoModelForSeq2SeqLM'>, <class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>, <class 'transformers.models.t5.modeling_t5.T5ForConditionalGeneration'>)

Hello @KartikKannapur,

Thank you for opening the issue. The snippet ist current not up to date regarding the versions. Could you try with transformers version 4.26?

Hey @philschmid

Thank you for looking into this.

I updated the transformers version to 4.26
The SageMaker SDK version is 2.132.0

and I received the following error message

ValueError: Unsupported huggingface version: 4.26.0. You may need to upgrade your SDK version (pip install -U sagemaker) for newer huggingface versions. Supported huggingface version(s): 4.6.1, 4.10.2, 4.11.0, 4.12.3, 4.17.0, 4.6, 4.10, 4.11, 4.12, 4.17.

So, I updated the SageMaker SDK to version 2.140.1
and when I ran the predictor.predict for model inference, I ran into the same error as before.

Any suggestions?

Could you share your code how you deployed? i will try to reproduce it then.

from sagemaker.huggingface import HuggingFaceModel
import sagemaker

role = sagemaker.get_execution_role()
# Hub Model configuration. https://huggingface.co/models
hub = {
	'HF_MODEL_ID':'allenai/cosmo-xl',
	'HF_TASK':'conversational'
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
	transformers_version='4.26.0',
	pytorch_version='1.13.1',
	py_version='py39',
	env=hub,
	role=role, 
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
	initial_instance_count=1, # number of instances
	instance_type='ml.m5.xlarge' # ec2 instance type
)

predictor.predict({
	'inputs': {
		"past_user_inputs": ["Which movie is the best ?"],
		"generated_responses": ["It's Die Hard for sure."],
		"text": "Can you explain why ?"
	}
})

Thanks @philschmid
Appreciate your support on this.

Ah i see. the task conversation is not correct here. Not sure why they added it manually. Can you switch the task to text2text-generation? then it should work.

Also from reading the model card: https://huggingface.co/allenai/cosmo-xl#how-to-use it might make sense if you create a custom infernece.py to get the "conversational" flow in to the generation method.

Understood. Thanks @philschmid

I'm going to go the custom inference.py route.