Make DEFAULT_HF_HUB_MODEL_EXPORT_DIRECTORY configurable
Opened this issue · 0 comments
pcolazurdo commented
In DEFAULT_HF_HUB_MODEL_EXPORT_DIRECTORY = os.path.join(os.getcwd(), ".sagemaker/mms/models")
the directory is forced to be in the same path as the current directory of the running process. In some SageMaker instances this is a relatively small partition that can't be extended. Allowing this var to be modified by an environment variable will allow the download of larger models in a variety of instances (i.e. ml.g5.16xlarge)
To reproduce the problem you can try this particular model (other large models will fail the same):
hub = {
'HF_MODEL_ID':'Salesforce/instructblip-flan-t5-xxl',
'HF_TASK':'image-to-text',
'SM_NUM_GPUS': '1',
'HF_HOME':'/tmp/hf_home',
'HF_ASSETS_CACHE': '/tmp/hf_assets_cache',
'HF_DATASETS_CACHE':'/tmp/hf_cache',
'HF_DATASETS_HOME':'/tmp/hf_home',
'HF_HUB_CACHE': '/tmp/hf_hub_cache'
}
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
transformers_version='4.37.0',
pytorch_version='2.1.0',
py_version='py310',
env=hub,
role=role,
)
# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
initial_instance_count=1, # number of instances
instance_type='ml.g5.16xlarge', # ec2 instance type
# volume_size=256
)
The error in CloudWatch is similar to:
OSError: [Errno 28] No space left on device: '/tmp/hf_hub_cache/tmpd1hcphh0' -> '/.sagemaker/mms/models/Salesforce__instructblip-flan-t5-xxl/pytorch_model-00001-of-00005.bin'