SageMaker Local Mode does not Inject default environment variables

Question

SageMaker Local Mode does not Inject default environment variables

Closed this issue 3 years ago · 9 comments

Describe the bug

Dear colleagues, I am trying to to run BYOC (Bring your own container) locally in a EC2 instance were we have installed VSCode Server. However I am getting an error when using argument parser to get the default environments of the training toolkit.

To reproduce

My Dockerfile:

FROM python:3.7 AS build
COPY ./code/requirements.txt .
RUN python3 -m pip install --upgrade pip && pip install -r ./requirements.txt

FROM gcr.io/distroless/python3-debian10

COPY --from=build /usr/local/lib/python3.7/site-packages/  /usr/lib/python3.7/.
COPY . /opt/ml/
WORKDIR /opt/ml/code

ENTRYPOINT ["python", "app.py"]

And my requirements:

pandas==1.1.5
numpy==1.19.2
boto3==1.17.28
awscli==1.19.39
joblib==1.0.1
sagemaker-training==3.9.2

My arguments in my entrypoint are as follows according to the documentation in the training toolkit https://github.com/aws/sagemaker-training-toolkit/blob/master/ENVIRONMENT_VARIABLES.md

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('--model_dir', type=str, default=os.environ['SM_MODEL_DIR'])
    parser.add_argument("--parameters", default=os.environ['SM_HPS'])
    parser.add_argument("--data_folder", type=str, default=os.environ['SM_CHANNEL_TRAIN'])
    parser.add_argument("--output_folder", type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])
    
    args, _ = parser.parse_known_args()
    model_dir = args.model_dir
    parameters = eval(args.parameters)  # force conversion to dictionary
    data_folder = args.data_folder
    output_folder = args.output_folder

    run_training()

My notebook launcher code:

my_session = boto3.session.Session(region_name=AWS_DEFAULT_REGION)
sagemaker_session = LocalSession(boto_session=my_session)
sagemaker_session.config = {'local': {'local_code': True}}
print("Execution ARN ROLE: "+ boto3.client('sts').get_caller_identity().get('Arn'))

execution_role = sagemaker_session.get_caller_identity_arn()
print("Sagemaker ARN ROLE: "+ execution_role)

print("Start training")
local_estimator = Estimator(image_uri='local_image:latest',
                      role = execution_role,
                      sagemaker_session = sagemaker_session, 
                      instance_count=1,
                      hyperparameters=hyperparameters,
                      instance_type="local")

local_train = '../input/data/training/preprocessing.csv'
train_location = 'file://'+local_train
local_estimator.fit({'train':train_location}, logs=True)

Screenshots or logs

However I am getting the following errors:

Attaching to 9r2vybp4ek-algo-1-tbsfy
9r2vybp4ek-algo-1-tbsfy | Traceback (most recent call last):
9r2vybp4ek-algo-1-tbsfy |   File "app.py", line 132, in 
9r2vybp4ek-algo-1-tbsfy |     parser.add_argument('--model_dir', type=str, default=os.environ['SM_MODEL_DIR'])
9r2vybp4ek-algo-1-tbsfy |   File "/usr/lib/python3.7/os.py", line 678, in __getitem__
9r2vybp4ek-algo-1-tbsfy |     raise KeyError(key) from None
9r2vybp4ek-algo-1-tbsfy | KeyError: 'SM_MODEL_DIR'
9r2vybp4ek-algo-1-tbsfy exited with code 1
Aborting on container exit...

System information

A description of your system. Please provide:

SageMaker Python SDK version: sagemaker 2.42.0
Framework name (eg. PyTorch) or algorithm (eg. KMeans): Custom Container
Python version: 3.7.10
CPU or GPU: CPU
Custom Docker image (Y/N): Y

Answer 1 · 2022-04-12T16:10:53.000Z

Should be resolved with aws/sagemaker-python-sdk#3015?

Answer 2 · 2022-04-12T16:14:05.000Z

Hi @dmagas-auto1

I am not longer using AWS (Im using currently Azure). I have contacted my old team mates to check on this. Maybe is useful to them.

BR
E

Answer 3 · 2022-04-19T07:49:15.000Z

@stefanorosss @glmourad

Answer 4 · 2022-04-20T21:26:42.000Z

This was a bug in sagemaker-python-sdk and has been resolved with aws/sagemaker-python-sdk#3015
This issue was reported on python-sdk repository aws/sagemaker-python-sdk#2930

Answer 5 · 2022-04-27T03:47:24.000Z

I am still encountering this bug, both in local and remote mode.

Passing hyperparameters to the Estimator object doesn't seem to affect anything.

My notebook:

# S3 prefix
prefix = "DEMO-scikit-byo-iris"

# Define IAM role
import boto3
import re

import os
import json
import numpy as np
import pandas as pd
import sagemaker as sage
from sagemaker import get_execution_role

role = get_execution_role()
sess = sage.Session()

# [ ... ] build and register container

WORK_DIRECTORY = "data"
data_location = sess.upload_data(WORK_DIRECTORY, key_prefix=prefix)

def json_encode_hyperparameters(hyperparameters):
    return {str(k): json.dumps(v) for (k, v) in hyperparameters.items()}

hyperparameters = json_encode_hyperparameters({
    "hp1": "value1",
    "hp2": 300,
    "hp3": 0.001})

sess = sage.Session()
account = sess.boto_session.client("sts").get_caller_identity()["Account"]
region = sess.boto_session.region_name
image = "{}.dkr.ecr.{}.amazonaws.com/sagemaker-my_proj:latest".format(account, region)

tree = sage.estimator.Estimator(
    image,
    role,
    1,
    "ml.c4.2xlarge",
    output_path="s3://{}/output".format(sess.default_bucket()),
    sagemaker_session=sess,
    hyperparameters=hyperparameters
)

tree.fit(data_location)

My train file:

#!/usr/bin/env python

from __future__ import print_function

import json
import os
import pickle
import sys
import traceback
import argparse

import pandas as pd
from sklearn import tree

prefix = '/opt/ml/'

input_path = prefix + 'input/data'
output_path = os.path.join(prefix, 'output')
model_path = os.path.join(prefix, 'model')

channel_name='training'
training_path = os.path.join(input_path, channel_name)

# The function to execute the training.
def train(hp1, hp2, hp3):
    print('Starting the training.')
    print(hp1)
    print(hp2)
    print(hp3)
    try:
       # [ ... ] load, train, save the model
          print('Training complete.')
    except Exception as e:
        # Write out an error file. This will be returned as the failureReason in the
        # DescribeTrainingJob result.
        trc = traceback.format_exc()
        with open(os.path.join(output_path, 'failure'), 'w') as s:
            s.write('Exception during training: ' + str(e) + '\n' + trc)
        # Printing this causes the exception to be in the training job logs, as well.
        print('Exception during training: ' + str(e) + '\n' + trc, file=sys.stderr)
        # A non-zero exit code causes the training job to be marked as Failed.
        sys.exit(255)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--hp1', type=str, default="value0")
    parser.add_argument("--hp2", type=int, default=150)
    parser.add_argument("--hp3", type=float)
    
    args, _ = parser.parse_known_args()
    
    hp1 = args.hp1
    hp2 = args.hp2
    hp3 = args.hp3
    
    train(hp1, hp2, hp3)

    # A zero exit code causes the job to be marked a Succeeded.
    sys.exit(0)

This is the output I get from running the notebook:

Starting the training.
value0
150
None
Training complete.

As you can see, hyperparameters=hyperparameters did nothing to change the arguments -- they remain at their default value instead (in the case of hp3, no default value was given thus it is None)

Any help would be appreciated.

Answer 6 · 2023-05-12T15:58:40.000Z

The issue reported here and the mentioned PR are not related. The PR allows passing arbitrary environment variables to a job whereas the reporter complained about the usual SM_* variables not getting injected to the job/docker environment. So far, my experience is the same: The documented env vars such as SM_MODEL_DIR e.g. does not seem to be defined in the Docker container's env when the entrypoint is called.

Answer 7 · 2023-06-16T21:17:35.000Z

@satishpasumarthi Could you consider re-opening this as I am still experiencing this with bring-your-own-container? The usual SM_* variables are not injected properly (and therefore not accessible from inside the training script). Using hyperparameters may work with small things but for a big JSON-encoded dictionary like SM_TRAINING_ENV this workaround isn't practical.

Answer 8 · 2023-08-09T15:07:37.000Z

We're also experiencing this with BYOC container; would love to have solved!

Answer 9 · 2024-09-24T08:05:22.000Z

Hello,
I am encountring this bug when using remote mode !
Default environment variables doesn't seem to be injected directly, is there any solution to this ?
Thanks