aws/sagemaker-huggingface-inference-toolkit

Support passing model_kwargs to pipeline

lukealexmiller opened this issue · 1 comments

I'm trying to deploy BLIP-2 (specifically Salesforce/blip2-opt-2.7b) to a Sagemaker (SM) endpoint, but coming up against some problems.

We can deploy this model by tar'ing the model artifacts as model.tar.gz and hosting on S3, but creating a ~9GB tar file is time-consuming and leads to slow deployment feedback loops.

Alternatively, the toolkit has experimental support for downloading models from 🤗Hub on start, which is a more time/space efficient.
However, this functionality only supports passing HF_TASK and HF_MODEL_ID as env vars. In order to run inference on this model using GPU's available on SM (T4/A10) we need to pass additional model_kwargs as:

pipe = pipeline(model="Salesforce/blip2-opt-2.7b", model_kwargs={"load_in_8bit": True})

A potential solution to this would be:
On line 104 of handler_service.py the ability to pass kwargs has not been implemented, but the function get_pipeline allows for kwargs.

Hello @lukealexmiller,

Thank you for opening the request. It is a good idea to think about adding "HF_KWARGS" as parameter.
In the meantime you can enable this by creating a custom inference.py. See here for an example: https://www.philschmid.de/custom-inference-huggingface-sagemaker