SageMaker Local Mode does not Inject default environment variables
Closed this issue ยท 9 comments
Describe the bug
Dear colleagues, I am trying to to run BYOC (Bring your own container) locally in a EC2 instance were we have installed VSCode Server. However I am getting an error when using argument parser to get the default environments of the training toolkit.
To reproduce
My Dockerfile:
FROM python:3.7 AS build
COPY ./code/requirements.txt .
RUN python3 -m pip install --upgrade pip && pip install -r ./requirements.txt
FROM gcr.io/distroless/python3-debian10
COPY --from=build /usr/local/lib/python3.7/site-packages/ /usr/lib/python3.7/.
COPY . /opt/ml/
WORKDIR /opt/ml/code
ENTRYPOINT ["python", "app.py"]
And my requirements:
pandas==1.1.5
numpy==1.19.2
boto3==1.17.28
awscli==1.19.39
joblib==1.0.1
sagemaker-training==3.9.2
My arguments in my entrypoint are as follows according to the documentation in the training toolkit https://github.com/aws/sagemaker-training-toolkit/blob/master/ENVIRONMENT_VARIABLES.md
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('--model_dir', type=str, default=os.environ['SM_MODEL_DIR'])
parser.add_argument("--parameters", default=os.environ['SM_HPS'])
parser.add_argument("--data_folder", type=str, default=os.environ['SM_CHANNEL_TRAIN'])
parser.add_argument("--output_folder", type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])
args, _ = parser.parse_known_args()
model_dir = args.model_dir
parameters = eval(args.parameters) # force conversion to dictionary
data_folder = args.data_folder
output_folder = args.output_folder
run_training()
My notebook launcher code:
my_session = boto3.session.Session(region_name=AWS_DEFAULT_REGION)
sagemaker_session = LocalSession(boto_session=my_session)
sagemaker_session.config = {'local': {'local_code': True}}
print("Execution ARN ROLE: "+ boto3.client('sts').get_caller_identity().get('Arn'))
execution_role = sagemaker_session.get_caller_identity_arn()
print("Sagemaker ARN ROLE: "+ execution_role)
print("Start training")
local_estimator = Estimator(image_uri='local_image:latest',
role = execution_role,
sagemaker_session = sagemaker_session,
instance_count=1,
hyperparameters=hyperparameters,
instance_type="local")
local_train = '../input/data/training/preprocessing.csv'
train_location = 'file://'+local_train
local_estimator.fit({'train':train_location}, logs=True)
Screenshots or logs
However I am getting the following errors:
Attaching to 9r2vybp4ek-algo-1-tbsfy
9r2vybp4ek-algo-1-tbsfy | Traceback (most recent call last):
9r2vybp4ek-algo-1-tbsfy | File "app.py", line 132, in
9r2vybp4ek-algo-1-tbsfy | parser.add_argument('--model_dir', type=str, default=os.environ['SM_MODEL_DIR'])
9r2vybp4ek-algo-1-tbsfy | File "/usr/lib/python3.7/os.py", line 678, in __getitem__
9r2vybp4ek-algo-1-tbsfy | raise KeyError(key) from None
9r2vybp4ek-algo-1-tbsfy | KeyError: 'SM_MODEL_DIR'
9r2vybp4ek-algo-1-tbsfy exited with code 1
Aborting on container exit...
System information
A description of your system. Please provide:
- SageMaker Python SDK version: sagemaker 2.42.0
- Framework name (eg. PyTorch) or algorithm (eg. KMeans): Custom Container
- Python version: 3.7.10
- CPU or GPU: CPU
- Custom Docker image (Y/N): Y
Should be resolved with aws/sagemaker-python-sdk#3015?
I am not longer using AWS (Im using currently Azure). I have contacted my old team mates to check on this. Maybe is useful to them.
BR
E
This was a bug in sagemaker-python-sdk and has been resolved with aws/sagemaker-python-sdk#3015
This issue was reported on python-sdk repository aws/sagemaker-python-sdk#2930
I am still encountering this bug, both in local and remote mode.
Passing hyperparameters to the Estimator object doesn't seem to affect anything.
My notebook:
# S3 prefix
prefix = "DEMO-scikit-byo-iris"
# Define IAM role
import boto3
import re
import os
import json
import numpy as np
import pandas as pd
import sagemaker as sage
from sagemaker import get_execution_role
role = get_execution_role()
sess = sage.Session()
# [ ... ] build and register container
WORK_DIRECTORY = "data"
data_location = sess.upload_data(WORK_DIRECTORY, key_prefix=prefix)
def json_encode_hyperparameters(hyperparameters):
return {str(k): json.dumps(v) for (k, v) in hyperparameters.items()}
hyperparameters = json_encode_hyperparameters({
"hp1": "value1",
"hp2": 300,
"hp3": 0.001})
sess = sage.Session()
account = sess.boto_session.client("sts").get_caller_identity()["Account"]
region = sess.boto_session.region_name
image = "{}.dkr.ecr.{}.amazonaws.com/sagemaker-my_proj:latest".format(account, region)
tree = sage.estimator.Estimator(
image,
role,
1,
"ml.c4.2xlarge",
output_path="s3://{}/output".format(sess.default_bucket()),
sagemaker_session=sess,
hyperparameters=hyperparameters
)
tree.fit(data_location)
My train
file:
#!/usr/bin/env python
from __future__ import print_function
import json
import os
import pickle
import sys
import traceback
import argparse
import pandas as pd
from sklearn import tree
prefix = '/opt/ml/'
input_path = prefix + 'input/data'
output_path = os.path.join(prefix, 'output')
model_path = os.path.join(prefix, 'model')
channel_name='training'
training_path = os.path.join(input_path, channel_name)
# The function to execute the training.
def train(hp1, hp2, hp3):
print('Starting the training.')
print(hp1)
print(hp2)
print(hp3)
try:
# [ ... ] load, train, save the model
print('Training complete.')
except Exception as e:
# Write out an error file. This will be returned as the failureReason in the
# DescribeTrainingJob result.
trc = traceback.format_exc()
with open(os.path.join(output_path, 'failure'), 'w') as s:
s.write('Exception during training: ' + str(e) + '\n' + trc)
# Printing this causes the exception to be in the training job logs, as well.
print('Exception during training: ' + str(e) + '\n' + trc, file=sys.stderr)
# A non-zero exit code causes the training job to be marked as Failed.
sys.exit(255)
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--hp1', type=str, default="value0")
parser.add_argument("--hp2", type=int, default=150)
parser.add_argument("--hp3", type=float)
args, _ = parser.parse_known_args()
hp1 = args.hp1
hp2 = args.hp2
hp3 = args.hp3
train(hp1, hp2, hp3)
# A zero exit code causes the job to be marked a Succeeded.
sys.exit(0)
This is the output I get from running the notebook:
Starting the training.
value0
150
None
Training complete.
As you can see, hyperparameters=hyperparameters
did nothing to change the arguments -- they remain at their default value instead (in the case of hp3, no default value was given thus it is None)
Any help would be appreciated.
The issue reported here and the mentioned PR are not related. The PR allows passing arbitrary environment variables to a job whereas the reporter complained about the usual SM_*
variables not getting injected to the job/docker environment. So far, my experience is the same: The documented env vars such as SM_MODEL_DIR
e.g. does not seem to be defined in the Docker container's env when the entrypoint is called.
@satishpasumarthi Could you consider re-opening this as I am still experiencing this with bring-your-own-container? The usual SM_*
variables are not injected properly (and therefore not accessible from inside the training script). Using hyperparameters may work with small things but for a big JSON-encoded dictionary like SM_TRAINING_ENV
this workaround isn't practical.
We're also experiencing this with BYOC container; would love to have solved!
Hello,
I am encountring this bug when using remote mode !
Default environment variables doesn't seem to be injected directly, is there any solution to this ?
Thanks