aws-samples/amazon-sagemaker-local-mode

Pipeline step 'inference-bodyscore' FAILED. Failure message is: PermissionError: [Errno 13] Permission denied

Closed this issue · 2 comments

Describe the bug
Trying Sagemaker Local Mode
One step of the pipeline (inference) fails quoting Failure message is: PermissionError: [Errno 13] Permission denied 'filename.json. The step is expected to create a few json files. Note that the directory /opt/ml/processing/output was not automatically created and had to be created manually inside the script. After the step fails, I crosschecked, the json files have been uploaded to the s3. I also confirmed, the json files have 777 permission.

To reproduce

estimator = PyTorch(
    entry_point="train.py",
    source_dir="code",
    role=role,
    # instance_type=instance_type_inference,
    instance_type='local_gpu',
    instance_count=1, #instance_count,
    framework_version='1.8.0',
    py_version='py3',
    output_path=output_path,
    # checkpoint_local_path="/opt/ml/checkpoints",
    # checkpoint_s3_uri=checkpoint_path,
    max_run=4 * 24 * 60 * 60,  # 4days
    metric_definitions=[
       {'Name': 'train:error', 'Regex': 'train Loss: (.*)'},
       {'Name': 'validation:error', 'Regex': 'val Loss: (.*)'}
    ],
    hyperparameters={
        'num_epochs': num_epochs.to_string(),
        'batch_size': batch_size.to_string(),
        'lr': learning_rate.to_string()
    },
    sagemaker_session=sess
)

step_train_model = TrainingStep(
    name="train-bodyscore",
    estimator=estimator,
    inputs={
        "train": TrainingInput(
            s3_data=dataset_path,
        )
    },
)

model_saved_path = step_train_model.properties.ModelArtifacts.S3ModelArtifacts

inference_processor = PyTorchProcessor(
    framework_version="1.8.0",
    role=role,
    # instance_type=instance_type_inference,
    instance_type='local_gpu',
    instance_count=1, #instance_count,
    sagemaker_session=sess
)

inference_args = inference_processor.get_run_args(
    inputs=[
                ProcessingInput(
                    input_name ='models',
                    source=model_saved_path,
                    destination="/opt/ml/processing/models"
                ),
                ProcessingInput(
                    input_name="input-images",
                    source=dataset_path,
                    destination="/opt/ml/processing/input"
                ),
            ],
    outputs=[
                ProcessingOutput(
                    output_name="inference-output",
                    source="/opt/ml/processing/output"
                )
            ],
    code='inference.py',
    source_dir='code',
),

inference_args = inference_args[0]

step_inference = ProcessingStep(
    name="inference-bodyscore",
    processor=inference_processor,
    # step_args=processor_args,
    inputs=inference_args.inputs,
    outputs=inference_args.outputs,
    job_arguments=inference_args.arguments,
    code=inference_args.code,
    # property_files=[evaluation_report]
)

pipeline = Pipeline(
    name="tlnk-bodyscore-pipeline",
    parameters=[
        instance_count,
        instance_type_inference,
        instance_type_processing,
        checkpoint_path,
        output_path,
        dataset_path,
        num_epochs,
        batch_size,
        learning_rate,
        ],
    steps=[
            # step_data_analysis,
            step_train_model,
            step_inference,
            # step_visualize,
            # step_evaluate,
           ],
    sagemaker_session=sess
)

definition = json.loads(pipeline.definition())
print(json.dumps(definition, indent=2))
pipeline.upsert(role_arn=role)

execution = pipeline.start()
# execution.wait()

Logs

7dc4or5zgg-algo-1-ink52 | INFO:__main__:calculated bodyscore: 3
7dc4or5zgg-algo-1-ink52 | permission for file is:  0o777
7dc4or5zgg-algo-1-ink52 exited with code 0
Aborting on container exit...
Pipeline step 'inference-bodyscore' FAILED. Failure message is: PermissionError: [Errno 13] Permission denied: 'filename.json'
Pipeline execution xyzabc FAILED because step 'inference-bodyscore' failed.

Not sure what kind of permission is being denied here. Is it the one to copy the files locally to some place, or something related to aws (though I feel I have given all permissions)

Was using older version of sagemaker. Issue has been fixed in latest version: aws/sagemaker-python-sdk#2647