**Failed MLPipeline**
franckess opened this issue · 5 comments
Hi @jessieweiyi
I tried to replicate your pipeline in AWS environment, however I get point of failure at MLPipeline
step (see screenshoots below).
Looking at the logs via CloudWatch, I can see this error message:
/miniconda3/bin/python _repack_model.py --dependencies --inference_script transform.py --model_archive s3://mlopsinfrastracturestack-sagemakerconstructsagema-gq7lhy69zuk9/PreprocessData-d7bd2a0ff50809ca886dd3b12220b78a/output/model --source_dir
Traceback (most recent call last):
File "_repack_model.py", line 109, in <module>
model_archive=args.model_archive,
File "_repack_model.py", line 55, in repack
shutil.copy2(model_path, local_path)
File "/miniconda3/lib/python3.7/shutil.py", line 266, in copy2
copyfile(src, dst, follow_symlinks=follow_symlinks)
File "/miniconda3/lib/python3.7/shutil.py", line 120, in copyfile
with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/input/data/training/model'
2022-05-02 04:49:20,281 sagemaker-containers ERROR Reporting training FAILURE
What am I missing here?
Thank you for you help
Hi @franckess thank you for reporting this. I will have a look today or tomorrow.
@jessieweiyi thanks for the prompt reply.
FYI, I am using vscode
as my development tool and BitBucket
as my repo for CI/CD.
Thanks
Hi Jessie thanks for putting this together, I'm working with Rene on it, we both get the same result, looks like a problem in sklearn repack step.
Hi @rdkls , @franckess,
Thank you for the update.
I confirmed that i can reproduce the same error in my side. Working on triaging the issue.
That's exactly what we found while debugging the error message.
model_data=Join(on='/', values=[step_process.properties.ProcessingOutputConfig.Outputs[
"model"].S3Output.S3Uri, "model.tar.gz"]),
@jessieweiyi thanks for fixing the issue.
Have a good one!