KeyError Length during training following workshop MLOps
MrRobotV8 opened this issue · 9 comments
AlgorithmError: ExecuteUserScriptError: Command "/opt/conda/bin/python3.8 train.py --epochs 1 --eval_batch_size 64 --fp16 True --learning_rate 3e-5 --model_id distilbert-base-uncased --train_batch_size 32" Traceback (most recent call last): File "train.py", line 46, in train_dataset = load_from_disk(args.training_dir)
Hello @MrRobotV8,
can you please provide more context regarding your error? That message is not helpful to reproduce. Have you prepared the dataset correctly and uploaded it to S3?
Hi @philschmid ,
I am following the execution of the 3 MLOps workshop, watched the videos and read the blog post on AWS. I have a ml.t3.medium instance for my notebook with conda_pytorch38 kernel.
As you may have seen on the forum post, I also tried to update transformer, pytorch version and change dataset.
This is the Processing Step, that seems to run properly:
`processing_output_destination = f"s3://{bucket}/{s3_prefix}/data"
sklearn_processor = SKLearnProcessor(
framework_version="0.23-1",
instance_type="ml.c5.2xlarge",
instance_count=1,
base_job_name=base_job_prefix + "/preprocessing",
sagemaker_session=sagemaker_session,
role=role,
)
step_process = ProcessingStep(
name="ProcessDataForTraining",
cache_config=cache_config,
processor=sklearn_processor,
job_arguments=["--transformers_version",transformers_version,
"--pytorch_version",pytorch_version,
"--model_id",model_id_,
"--dataset_name",dataset_name_],
outputs=[
ProcessingOutput(
output_name="train",
destination=f"{processing_output_destination}/train",
source="/opt/ml/processing/train",
),
ProcessingOutput(
output_name="test",
destination=f"{processing_output_destination}/test",
source="/opt/ml/processing/test",
),
ProcessingOutput(
output_name="validation",
destination=f"{processing_output_destination}/test",
source="/opt/ml/processing/validation",
),
],
code="./scripts/preprocessing.py",
)`
Then the training Step of my pipeline failed with teh following code:
AlgorithmError: ExecuteUserScriptError: ExitCode 1 ErrorMessage "KeyError: 'length'" Command "/opt/conda/bin/python3.8 train.py --epochs 3 --eval_batch_size 64 --fp16 True --learning_rate 3e-05 --model_id distilbert-base-uncased --train_batch_size 32", exit code: 1
`huggingface_estimator = HuggingFace(
entry_point="train.py",
source_dir="./scripts",
base_job_name=base_job_prefix + "/training",
instance_type="ml.p3.2xlarge",
instance_count=1,
role=role,
transformers_version=transformers_version,
pytorch_version=pytorch_version,
py_version=py_version,
hyperparameters={
'epochs':epochs,
'eval_batch_size': eval_batch_size,
'train_batch_size': train_batch_size,
'learning_rate': learning_rate,
'model_id': model_id,
'fp16': fp16
},
sagemaker_session=sagemaker_session,
)
step_train = TrainingStep(
name="TrainHuggingFaceModel",
estimator=huggingface_estimator,
inputs={
"train": TrainingInput(
s3_data=step_process.properties.ProcessingOutputConfig.Outputs[
"train"
].S3Output.S3Uri
),
"test": TrainingInput(
s3_data=step_process.properties.ProcessingOutputConfig.Outputs[
"test"
].S3Output.S3Uri
),
},
cache_config=cache_config,
)`
The train.py file, such as all teh other evaluate.py, deploy_handler.py, etc... are copied and pasted from the repo.
At the end of the processing step, data are uploaded to s3 in teh correct path defined. I see three files for train and same (different size) for test: dataset_info.json, datasetarrow and states.
Is the framework version of the SKLearn too outdated? 0.23-1..
As you may have seen on the forum post, I also tried to update transformer, pytorch version and change dataset.
To which versions have you updated?
package versions
transformers_version = "4.17.0"
pytorch_version = "1.10.2"
py_version = "py38"
model_id_="distilbert-base-uncased"
dataset_name_="emotion"
Using cached sagemaker-2.119.0-py2.py3-none-any.whl
And datasets? Are you using 1.18.X
since thats the latest installed in the container: https://github.com/aws/deep-learning-containers/blob/master/huggingface/pytorch/buildspec.yml#LL42C42-L42C48
I didn't explicitly defined it, in the preprocessing file in the repo, we are doing:
install("datasets[s3]")
Now I edited with:
install("datasets[s3]==1.18.4")
I think I can be able to share the output in 10 minutes.
Training Image is: 763104351884.dkr.ecr.eu-west-1.amazonaws.com/huggingface-pytorch-training:1.10.2-transformers4.17.0-gpu-py38-cu113-ubuntu20.04
I don't have still an output because the training is proceeding... Hope that was just dataset version. By the way, I will let you know the output once finished.
In the meanwhile I can also share the definition of my pipeline, I don't want to have missed something.
{'Version': '2020-12-01',
'Metadata': {},
'Parameters': [{'Name': 'ModelId',
'Type': 'String',
'DefaultValue': 'distilbert-base-uncased'},
{'Name': 'DatasetName', 'Type': 'String', 'DefaultValue': 'emotion'},
{'Name': 'ProcessingInstanceType',
'Type': 'String',
'DefaultValue': 'ml.c5.2xlarge'},
{'Name': 'ProcessingInstanceCount', 'Type': 'Integer', 'DefaultValue': 1},
{'Name': 'ProcessingScript',
'Type': 'String',
'DefaultValue': './scripts/preprocessing.py'},
{'Name': 'TrainingEntryPoint', 'Type': 'String', 'DefaultValue': 'train.py'},
{'Name': 'TrainingSourceDir', 'Type': 'String', 'DefaultValue': './scripts'},
{'Name': 'TrainingInstanceType',
'Type': 'String',
'DefaultValue': 'ml.p3.2xlarge'},
{'Name': 'TrainingInstanceCount', 'Type': 'Integer', 'DefaultValue': 1},
{'Name': 'EvaluationScript',
'Type': 'String',
'DefaultValue': './scripts/evaluate.py'},
{'Name': 'ThresholdAccuracy', 'Type': 'Float', 'DefaultValue': 0.8},
{'Name': 'Epochs', 'Type': 'String', 'DefaultValue': '1'},
{'Name': 'EvalBatchSize', 'Type': 'String', 'DefaultValue': '32'},
{'Name': 'TrainBatchSize', 'Type': 'String', 'DefaultValue': '16'},
{'Name': 'LearningRate', 'Type': 'String', 'DefaultValue': '3e-5'},
{'Name': 'Fp16', 'Type': 'String', 'DefaultValue': 'True'}],
'PipelineExperimentConfig': {'ExperimentName': {'Get': 'Execution.PipelineName'},
'TrialName': {'Get': 'Execution.PipelineExecutionId'}},
'Steps': [{'Name': 'ProcessDataForTraining',
'Type': 'Processing',
'Arguments': {'ProcessingResources': {'ClusterConfig': {'InstanceType': 'ml.c5.2xlarge',
'InstanceCount': 1,
'VolumeSizeInGB': 30}},
'AppSpecification': {'ImageUri': '141502667606.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-scikit-learn:0.23-1-cpu-py3',
'ContainerArguments': ['--transformers_version',
'4.17.0',
'--pytorch_version',
'1.10.2',
'--model_id',
'distilbert-base-uncased',
'--dataset_name',
'emotion'],
'ContainerEntrypoint': ['python3',
'/opt/ml/processing/input/code/preprocessing.py']},
'RoleArn': 'arn:aws:iam::183512891321:role/service-role/AmazonSageMaker-ExecutionRole-20221125T161684',
'ProcessingInputs': [{'InputName': 'code',
'AppManaged': False,
'S3Input': {'S3Uri': 's3://sagemaker-eu-west-1-183512891321/ProcessDataForTraining-86437d43df6eeb597c9c5a3520836925/input/code/preprocessing.py',
'LocalPath': '/opt/ml/processing/input/code',
'S3DataType': 'S3Prefix',
'S3InputMode': 'File',
'S3DataDistributionType': 'FullyReplicated',
'S3CompressionType': 'None'}}],
'ProcessingOutputConfig': {'Outputs': [{'OutputName': 'train',
'AppManaged': False,
'S3Output': {'S3Uri': 's3://sagemaker-eu-west-1-183512891321/hugging-face-pipeline-demo/data/train',
'LocalPath': '/opt/ml/processing/train',
'S3UploadMode': 'EndOfJob'}},
{'OutputName': 'test',
'AppManaged': False,
'S3Output': {'S3Uri': 's3://sagemaker-eu-west-1-183512891321/hugging-face-pipeline-demo/data/test',
'LocalPath': '/opt/ml/processing/test',
'S3UploadMode': 'EndOfJob'}},
{'OutputName': 'validation',
'AppManaged': False,
'S3Output': {'S3Uri': 's3://sagemaker-eu-west-1-183512891321/hugging-face-pipeline-demo/data/test',
'LocalPath': '/opt/ml/processing/validation',
'S3UploadMode': 'EndOfJob'}}]}},
'CacheConfig': {'Enabled': False, 'ExpireAfter': '30d'}},
{'Name': 'TrainHuggingFaceModel',
'Type': 'Training',
'Arguments': {'AlgorithmSpecification': {'TrainingInputMode': 'File',
'TrainingImage': '763104351884.dkr.ecr.eu-west-1.amazonaws.com/huggingface-pytorch-training:1.10.2-transformers4.17.0-gpu-py38-cu113-ubuntu20.04',
'EnableSageMakerMetricsTimeSeries': True},
'OutputDataConfig': {'S3OutputPath': 's3://sagemaker-eu-west-1-183512891321/'},
'StoppingCondition': {'MaxRuntimeInSeconds': 86400},
'ResourceConfig': {'VolumeSizeInGB': 30,
'InstanceCount': 1,
'InstanceType': 'ml.p3.2xlarge'},
'RoleArn': 'arn:aws:iam::183512891321:role/service-role/AmazonSageMaker-ExecutionRole-20221125T161684',
'InputDataConfig': [{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix',
'S3Uri': {'Get': "Steps.ProcessDataForTraining.ProcessingOutputConfig.Outputs['train'].S3Output.S3Uri"},
'S3DataDistributionType': 'FullyReplicated'}},
'ChannelName': 'train'},
{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix',
'S3Uri': {'Get': "Steps.ProcessDataForTraining.ProcessingOutputConfig.Outputs['test'].S3Output.S3Uri"},
'S3DataDistributionType': 'FullyReplicated'}},
'ChannelName': 'test'}],
'HyperParameters': {'epochs': {'Get': 'Parameters.Epochs'},
'eval_batch_size': {'Get': 'Parameters.EvalBatchSize'},
'train_batch_size': {'Get': 'Parameters.TrainBatchSize'},
'learning_rate': {'Get': 'Parameters.LearningRate'},
'model_id': {'Get': 'Parameters.ModelId'},
'fp16': {'Get': 'Parameters.Fp16'},
'sagemaker_submit_directory': '"s3://sagemaker-eu-west-1-183512891321/TrainHuggingFaceModel-0a8b7473ba341a719507d482a6891cd9/source/sourcedir.tar.gz"',
'sagemaker_program': '"train.py"',
'sagemaker_container_log_level': '20',
'sagemaker_region': '"eu-west-1"'},
'DebugHookConfig': {'S3OutputPath': 's3://sagemaker-eu-west-1-183512891321/',
'CollectionConfigurations': []},
'ProfilerRuleConfigurations': [{'RuleConfigurationName': 'ProfilerReport-1670233247',
'RuleEvaluatorImage': '929884845733.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-debugger-rules:latest',
'RuleParameters': {'rule_to_invoke': 'ProfilerReport'}}],
'ProfilerConfig': {'S3OutputPath': 's3://sagemaker-eu-west-1-183512891321/'}},
'CacheConfig': {'Enabled': False, 'ExpireAfter': '30d'}},
{'Name': 'HuggingfaceEvalLoss',
'Type': 'Processing',
'Arguments': {'ProcessingResources': {'ClusterConfig': {'InstanceType': {'Get': 'Parameters.ProcessingInstanceType'},
'InstanceCount': {'Get': 'Parameters.ProcessingInstanceCount'},
'VolumeSizeInGB': 30}},
'AppSpecification': {'ImageUri': '141502667606.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-scikit-learn:0.23-1-cpu-py3',
'ContainerEntrypoint': ['python3',
'/opt/ml/processing/input/code/evaluate.py']},
'RoleArn': 'arn:aws:iam::183512891321:role/service-role/AmazonSageMaker-ExecutionRole-20221125T161684',
'ProcessingInputs': [{'InputName': 'input-1',
'AppManaged': False,
'S3Input': {'S3Uri': {'Get': 'Steps.TrainHuggingFaceModel.ModelArtifacts.S3ModelArtifacts'},
'LocalPath': '/opt/ml/processing/model',
'S3DataType': 'S3Prefix',
'S3InputMode': 'File',
'S3DataDistributionType': 'FullyReplicated',
'S3CompressionType': 'None'}},
{'InputName': 'code',
'AppManaged': False,
'S3Input': {'S3Uri': 's3://sagemaker-eu-west-1-183512891321/HuggingfaceEvalLoss-c59f512db1d458dadc4e83437b76244e/input/code/evaluate.py',
'LocalPath': '/opt/ml/processing/input/code',
'S3DataType': 'S3Prefix',
'S3InputMode': 'File',
'S3DataDistributionType': 'FullyReplicated',
'S3CompressionType': 'None'}}],
'ProcessingOutputConfig': {'Outputs': [{'OutputName': 'evaluation',
'AppManaged': False,
'S3Output': {'S3Uri': 's3://sagemaker-eu-west-1-183512891321/hugging-face-pipeline-demo/evaluation_report',
'LocalPath': '/opt/ml/processing/evaluation',
'S3UploadMode': 'EndOfJob'}}]}},
'CacheConfig': {'Enabled': False, 'ExpireAfter': '30d'},
'PropertyFiles': [{'PropertyFileName': 'HuggingFaceEvaluationReport',
'OutputName': 'evaluation',
'FilePath': 'evaluation.json'}]},
{'Name': 'CheckHuggingfaceEvalAccuracy',
'Type': 'Condition',
'Arguments': {'Conditions': [{'Type': 'GreaterThanOrEqualTo',
'LeftValue': {'Std:JsonGet': {'PropertyFile': {'Get': 'Steps.HuggingfaceEvalLoss.PropertyFiles.HuggingFaceEvaluationReport'},
'Path': 'eval_accuracy'}},
'RightValue': {'Get': 'Parameters.ThresholdAccuracy'}}],
'IfSteps': [{'Name': 'HuggingFaceRegisterModel-RegisterModel',
'Type': 'RegisterModel',
'Arguments': {'ModelPackageGroupName': 'HuggingFaceModelPackageGroup',
'InferenceSpecification': {'Containers': [{'Image': '763104351884.dkr.ecr.eu-west-1.amazonaws.com/huggingface-pytorch-inference:1.10.2-transformers4.17.0-gpu-py38-cu113-ubuntu20.04',
'Environment': {'SAGEMAKER_PROGRAM': '',
'SAGEMAKER_SUBMIT_DIRECTORY': '',
'SAGEMAKER_CONTAINER_LOG_LEVEL': '20',
'SAGEMAKER_REGION': 'eu-west-1'},
'ModelDataUrl': {'Get': 'Steps.TrainHuggingFaceModel.ModelArtifacts.S3ModelArtifacts'}}],
'SupportedContentTypes': ['application/json'],
'SupportedResponseMIMETypes': ['application/json'],
'SupportedRealtimeInferenceInstanceTypes': ['ml.g4dn.xlarge',
'ml.m5.xlarge'],
'SupportedTransformInstanceTypes': ['ml.g4dn.xlarge', 'ml.m5.xlarge']},
'ModelApprovalStatus': 'Approved'}},
{'Name': 'HuggingFaceModelDeployment',
'Type': 'Lambda',
'Arguments': {'model_name': 'distilbert-base-uncased-emotion12-05-09-39-43',
'endpoint_config_name': 'distilbert-base-uncased-emotion12-05-09-39-43',
'endpoint_name': 'distilbert-base-uncased-emotion',
'endpoint_instance_type': 'ml.g4dn.xlarge',
'model_package_arn': {'Get': 'Steps.HuggingFaceRegisterModel-RegisterModel.ModelPackageArn'},
'role': 'arn:aws:iam::183512891321:role/service-role/AmazonSageMaker-ExecutionRole-20221125T161684'},
'FunctionArn': 'arn:aws:lambda:eu-west-1:183512891321:function:sagemaker-pipelines-model-deployment-12-05-09-39-43',
'OutputParameters': [{'OutputName': 'statusCode',
'OutputType': 'String'},
{'OutputName': 'body', 'OutputType': 'String'},
{'OutputName': 'other_key', 'OutputType': 'String'}]}],
'ElseSteps': []}}]}
It works! Thank you @philschmid !
Not sure if this is related to this issue too, but we're getting similar problems on some of our datasets in our SageMaker Pipelines, using various versions of datasets
(1.18.4, 2.5.2, 2.7.1, etc.).
The weird thing for us is that it only seems to be happening on some of our HF datasets, but not others. I haven't done a deep dive into the differences in these files yet, but that's my next step. Thought I'd post here in case, though!