aws/sagemaker-huggingface-inference-toolkit

batch transform fails to install

scotteggs opened this issue · 2 comments

I'm using HuggingFaceModel from sagemaker inference toolkit to create batch transform jobs for large scale inference data pipelines. I started by looking at the example and walk through here: https://huggingface.co/docs/sagemaker/inference

I regularly run into an issue where about 1/2 of submitted jobs will fail at around the 35 min mark.

Looking at the logs what I see is:

ValueError("failed to install required packages")

This is the same error across all of the failures.

I run a lot of batch transform jobs on production scale data, and I'm wondering if we're simply running into issues with throttling from pip.

Is there an approach to leverage the HuggingFaceModel container creation, but then reuse the registered model within Sagemaker so that we're not re-creating the container with each new execution?

I have looked around and do not see any examples like this.

This is the general approach we have for creating the batch transform jobs

from sagemaker.huggingface.model import HuggingFaceModel
hub = {"HF_TASK": "summarization"}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    model_data=model_s3_location,
    role=role,
    transformers_version="4.17",
    pytorch_version="1.10",
    py_version="py38",
    env=hub,
)

batch_job = huggingface_model.transformer(
    instance_count=instance_count,
    instance_type=instance_type,
    strategy="SingleRecord",
    output_path=output_path,
    assemble_with="Line",
)

formatted_time = datetime.now().strftime("%Y-%m-%d-%H-%M-%S")

batch_job.transform(
    data=input_path,
    content_type="application/json",
    split_type="Line",
    logs=False,
    wait=False,
    job_name=f"my-batch-job-batch-{batch}-shard-{shard}-{formatted_time}",
    model_client_config={"InvocationsMaxRetries": 3, "InvocationsTimeoutInSeconds": 600},
)

@philschmid thanks for the great examples, I'm wondering if you have any thoughts on this.

Hello @scotteggs,

We have an working example here: https://github.com/huggingface/notebooks/blob/main/sagemaker/12_batch_transform_inference/sagemaker-notebook.ipynb

Regarding the error
ValueError("failed to install required packages")

what packages are you installing?