Pipeline step 'inference-bodyscore' FAILED. Failure message is: PermissionError: [Errno 13] Permission denied
Closed this issue · 2 comments
Describe the bug
Trying Sagemaker Local Mode
One step of the pipeline (inference) fails quoting Failure message is: PermissionError: [Errno 13] Permission denied 'filename.json
. The step is expected to create a few json files. Note that the directory /opt/ml/processing/output
was not automatically created and had to be created manually inside the script. After the step fails, I crosschecked, the json files have been uploaded to the s3. I also confirmed, the json files have 777
permission.
To reproduce
estimator = PyTorch(
entry_point="train.py",
source_dir="code",
role=role,
# instance_type=instance_type_inference,
instance_type='local_gpu',
instance_count=1, #instance_count,
framework_version='1.8.0',
py_version='py3',
output_path=output_path,
# checkpoint_local_path="/opt/ml/checkpoints",
# checkpoint_s3_uri=checkpoint_path,
max_run=4 * 24 * 60 * 60, # 4days
metric_definitions=[
{'Name': 'train:error', 'Regex': 'train Loss: (.*)'},
{'Name': 'validation:error', 'Regex': 'val Loss: (.*)'}
],
hyperparameters={
'num_epochs': num_epochs.to_string(),
'batch_size': batch_size.to_string(),
'lr': learning_rate.to_string()
},
sagemaker_session=sess
)
step_train_model = TrainingStep(
name="train-bodyscore",
estimator=estimator,
inputs={
"train": TrainingInput(
s3_data=dataset_path,
)
},
)
model_saved_path = step_train_model.properties.ModelArtifacts.S3ModelArtifacts
inference_processor = PyTorchProcessor(
framework_version="1.8.0",
role=role,
# instance_type=instance_type_inference,
instance_type='local_gpu',
instance_count=1, #instance_count,
sagemaker_session=sess
)
inference_args = inference_processor.get_run_args(
inputs=[
ProcessingInput(
input_name ='models',
source=model_saved_path,
destination="/opt/ml/processing/models"
),
ProcessingInput(
input_name="input-images",
source=dataset_path,
destination="/opt/ml/processing/input"
),
],
outputs=[
ProcessingOutput(
output_name="inference-output",
source="/opt/ml/processing/output"
)
],
code='inference.py',
source_dir='code',
),
inference_args = inference_args[0]
step_inference = ProcessingStep(
name="inference-bodyscore",
processor=inference_processor,
# step_args=processor_args,
inputs=inference_args.inputs,
outputs=inference_args.outputs,
job_arguments=inference_args.arguments,
code=inference_args.code,
# property_files=[evaluation_report]
)
pipeline = Pipeline(
name="tlnk-bodyscore-pipeline",
parameters=[
instance_count,
instance_type_inference,
instance_type_processing,
checkpoint_path,
output_path,
dataset_path,
num_epochs,
batch_size,
learning_rate,
],
steps=[
# step_data_analysis,
step_train_model,
step_inference,
# step_visualize,
# step_evaluate,
],
sagemaker_session=sess
)
definition = json.loads(pipeline.definition())
print(json.dumps(definition, indent=2))
pipeline.upsert(role_arn=role)
execution = pipeline.start()
# execution.wait()
Logs
7dc4or5zgg-algo-1-ink52 | INFO:__main__:calculated bodyscore: 3
7dc4or5zgg-algo-1-ink52 | permission for file is: 0o777
7dc4or5zgg-algo-1-ink52 exited with code 0
Aborting on container exit...
Pipeline step 'inference-bodyscore' FAILED. Failure message is: PermissionError: [Errno 13] Permission denied: 'filename.json'
Pipeline execution xyzabc FAILED because step 'inference-bodyscore' failed.
Not sure what kind of permission is being denied here. Is it the one to copy the files locally to some place, or something related to aws (though I feel I have given all permissions)
Was using older version of sagemaker. Issue has been fixed in latest version: aws/sagemaker-python-sdk#2647