Validate smddprun() fails with file not found error on AL2023
jimmyrigby94 opened this issue · 1 comments
Describe the bug
Training jobs fail with using custom docker built off AL2023 due to a subprocess error that is unhandled by exceptions
Sagemaker Training Job Error
AlgorithmError: Framework Error: Traceback (most recent call last): File "/usr/local/lib64/python3.9/site-packages/sagemaker_training/trainer.py", line 70, in train env = environment.Environment() File "/usr/local/lib64/python3.9/site-packages/sagemaker_training/environment.py", line 690, in __init__ self._is_smddprun_installed = validate_smddprun() File "/usr/local/lib64/python3.9/site-packages/sagemaker_training/environment.py", line 369, in validate_smddprun output = subprocess.run( File "/usr/lib64/python3.9/subprocess.py", line 505, in run with Popen(*popenargs, **kwargs) as process: File "/usr/lib64/python3.9/subprocess.py", line 951, in __init__ self._execute_child(args, executable, preexec_fn, close_fds, File "/usr/lib64/python3.9/subprocess.py", line 1821, in _execute_child raise child_exception_type(errno_num, err_msg, err_filename) FileNotFoundError: [Errno 2] No such file or directory: 'which' [Errno 2] No such file or directory: 'which', exit code: 2
To reproduce
`
FROM public.ecr.aws/amazonlinux/amazonlinux:2023
RUN yum install --assumeyes python3-pip python-devel gcc &&
pip install setuptools &&
pip install sagemaker-training
`
docker build . docker run -it {container-id} /bin/bash
python3
`
from sagemaker_training import environment
environment.validate_smddprun()
`
Output:
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib64/python3.9/site-packages/sagemaker_training/environment.py", line 369, in validate_smddprun output = subprocess.run( File "/usr/lib64/python3.9/subprocess.py", line 505, in run with Popen(*popenargs, **kwargs) as process: File "/usr/lib64/python3.9/subprocess.py", line 951, in __init__ self._execute_child(args, executable, preexec_fn, close_fds, File "/usr/lib64/python3.9/subprocess.py", line 1821, in _execute_child raise child_exception_type(errno_num, err_msg, err_filename) FileNotFoundError: [Errno 2] No such file or directory: 'which'
Changing the subprocess.run to include shell = True resolves this error
Expected behavior
The subprocess not to cause the training job to fail on AL2023.
Screenshots or logs
If applicable, add screenshots or logs to help explain your problem.
System information
See the dockerfile in repex
Should have realized this earlier, but explicitly installing which as a system dependency resolves things. I just assumed which was a default system dependency.
FROM public.ecr.aws/amazonlinux/amazonlinux:2023
RUN yum install --assumeyes python3-pip python-devel gcc which &&
pip install setuptools &&
pip install sagemaker-training