Sagemaker HuggingfaceModel fails on phi3 model deployment
manikawnth opened this issue · 2 comments
manikawnth commented
I'm not able to deploy the Phi3 model from huggingface model hub to sagemaker.
I tried using multiple DLC containers, with and without trust_remote_code: true
. Still not able to get it run.
I receive the following error:
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 258, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 222, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 420, in get_model
return FlashLlama(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_llama.py", line 84, in __init__
model = FlashLlamaForCausalLM(prefix, config, weights)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 368, in __init__
self.model = FlashLlamaModel(prefix, config, weights)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 292, in __init__
[
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 293, in <listcomp>
FlashLlamaLayer(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 232, in __init__
self.self_attn = FlashLlamaAttention(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 108, in __init__
self.query_key_value = load_attention(config, prefix, weights)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 43, in load_attention
bias = config.attention_bias
File "/opt/conda/lib/python3.10/site-packages/transformers/configuration_utils.py", line 263, in __getattribute__
return super().__getattribute__(key)
AttributeError: 'Phi3Config' object has no attribute 'attention_bias' #033[2m#033[3mrank#033[0m#033[2m=#033[0m0#033[0m
#033[2m2024-05-21T16:19:40.764815Z#033[0m #033[31mERROR#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Shard 0 failed to start
#033[2m2024-05-21T16:19:40.764834Z#033[0m #033[32m INFO#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Shutting down shards
Error: ShardCannotStart
from sagemaker import get_execution_role, Session
import boto3
sagemaker_session = Session()
region = boto3.Session().region_name
# get execution role
# please use execution role if you are using notebook instance or update the role arn if you are using a different role
execution_role = get_execution_role()
image_uri = '763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.3.0-tgi2.0.3-gpu-py310-cu121-ubuntu22.04-v2.0'
from sagemaker.huggingface import HuggingFaceModel
hub = {
'HF_TASK': 'text-generation',
'HF_MODEL_ID':'microsoft/Phi-3-mini-128k-instruct',
'TRUST_REMOTE_CODE': 'true',
'HF_MODEL_TRUST_REMOTE_CODE': 'true'
}
huggingface_model = HuggingFaceModel(
env=hub,
image_uri=image_uri,
role=execution_role,
sagemaker_session=sagemaker_session
)
predictor = huggingface_model.deploy( initial_instance_count=1,instance_type="ml.g5.2xlarge")
philschmid commented
We opened a PR to fix this. https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/discussions/68
manikawnth commented
@philschmid Thanks for that PR. It's working fine, when I pointed it to that revision.
However, shouldn't the issue be actually fixed upstream, by initializing config.attention_bias = False
?
OR