aws/sagemaker-huggingface-inference-toolkit

Make DEFAULT_HF_HUB_MODEL_EXPORT_DIRECTORY configurable

Opened this issue · 0 comments

In DEFAULT_HF_HUB_MODEL_EXPORT_DIRECTORY = os.path.join(os.getcwd(), ".sagemaker/mms/models") the directory is forced to be in the same path as the current directory of the running process. In some SageMaker instances this is a relatively small partition that can't be extended. Allowing this var to be modified by an environment variable will allow the download of larger models in a variety of instances (i.e. ml.g5.16xlarge)

To reproduce the problem you can try this particular model (other large models will fail the same):

hub = {
	'HF_MODEL_ID':'Salesforce/instructblip-flan-t5-xxl',
	'HF_TASK':'image-to-text',
    'SM_NUM_GPUS': '1',
    'HF_HOME':'/tmp/hf_home',
    'HF_ASSETS_CACHE': '/tmp/hf_assets_cache',
    'HF_DATASETS_CACHE':'/tmp/hf_cache',
    'HF_DATASETS_HOME':'/tmp/hf_home',
    'HF_HUB_CACHE': '/tmp/hf_hub_cache'
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
	transformers_version='4.37.0',
	pytorch_version='2.1.0',
	py_version='py310',
	env=hub,
	role=role,
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
	initial_instance_count=1, # number of instances
	instance_type='ml.g5.16xlarge', # ec2 instance type
    # volume_size=256
)

The error in CloudWatch is similar to:

OSError: [Errno 28] No space left on device: '/tmp/hf_hub_cache/tmpd1hcphh0' -> '/.sagemaker/mms/models/Salesforce__instructblip-flan-t5-xxl/pytorch_model-00001-of-00005.bin'