aws/sagemaker-training-toolkit

SageMaker Local Mode does not Inject default environment variables

Closed this issue ยท 9 comments

edgBR commented

Describe the bug

Dear colleagues, I am trying to to run BYOC (Bring your own container) locally in a EC2 instance were we have installed VSCode Server. However I am getting an error when using argument parser to get the default environments of the training toolkit.

To reproduce

My Dockerfile:

FROM python:3.7 AS build
COPY ./code/requirements.txt .
RUN python3 -m pip install --upgrade pip && pip install -r ./requirements.txt

FROM gcr.io/distroless/python3-debian10

COPY --from=build /usr/local/lib/python3.7/site-packages/  /usr/lib/python3.7/.
COPY . /opt/ml/
WORKDIR /opt/ml/code

ENTRYPOINT ["python", "app.py"]

And my requirements:

pandas==1.1.5
numpy==1.19.2
boto3==1.17.28
awscli==1.19.39
joblib==1.0.1
sagemaker-training==3.9.2

My arguments in my entrypoint are as follows according to the documentation in the training toolkit https://github.com/aws/sagemaker-training-toolkit/blob/master/ENVIRONMENT_VARIABLES.md

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('--model_dir', type=str, default=os.environ['SM_MODEL_DIR'])
    parser.add_argument("--parameters", default=os.environ['SM_HPS'])
    parser.add_argument("--data_folder", type=str, default=os.environ['SM_CHANNEL_TRAIN'])
    parser.add_argument("--output_folder", type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])
    
    args, _ = parser.parse_known_args()
    model_dir = args.model_dir
    parameters = eval(args.parameters)  # force conversion to dictionary
    data_folder = args.data_folder
    output_folder = args.output_folder

    run_training()

My notebook launcher code:

my_session = boto3.session.Session(region_name=AWS_DEFAULT_REGION)
sagemaker_session = LocalSession(boto_session=my_session)
sagemaker_session.config = {'local': {'local_code': True}}
print("Execution ARN ROLE: "+ boto3.client('sts').get_caller_identity().get('Arn'))

execution_role = sagemaker_session.get_caller_identity_arn()
print("Sagemaker ARN ROLE: "+ execution_role)

print("Start training")
local_estimator = Estimator(image_uri='local_image:latest',
                      role = execution_role,
                      sagemaker_session = sagemaker_session, 
                      instance_count=1,
                      hyperparameters=hyperparameters,
                      instance_type="local")

local_train = '../input/data/training/preprocessing.csv'
train_location = 'file://'+local_train
local_estimator.fit({'train':train_location}, logs=True)

Screenshots or logs

However I am getting the following errors:

Attaching to 9r2vybp4ek-algo-1-tbsfy
9r2vybp4ek-algo-1-tbsfy | Traceback (most recent call last):
9r2vybp4ek-algo-1-tbsfy |   File "app.py", line 132, in 
9r2vybp4ek-algo-1-tbsfy |     parser.add_argument('--model_dir', type=str, default=os.environ['SM_MODEL_DIR'])
9r2vybp4ek-algo-1-tbsfy |   File "/usr/lib/python3.7/os.py", line 678, in __getitem__
9r2vybp4ek-algo-1-tbsfy |     raise KeyError(key) from None
9r2vybp4ek-algo-1-tbsfy | KeyError: 'SM_MODEL_DIR'
9r2vybp4ek-algo-1-tbsfy exited with code 1
Aborting on container exit...

System information

A description of your system. Please provide:

  • SageMaker Python SDK version: sagemaker 2.42.0
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): Custom Container
  • Python version: 3.7.10
  • CPU or GPU: CPU
  • Custom Docker image (Y/N): Y

Should be resolved with aws/sagemaker-python-sdk#3015?

edgBR commented

Hi @dmagas-auto1

I am not longer using AWS (Im using currently Azure). I have contacted my old team mates to check on this. Maybe is useful to them.

BR
E

This was a bug in sagemaker-python-sdk and has been resolved with aws/sagemaker-python-sdk#3015
This issue was reported on python-sdk repository aws/sagemaker-python-sdk#2930

I am still encountering this bug, both in local and remote mode.

Passing hyperparameters to the Estimator object doesn't seem to affect anything.

My notebook:

# S3 prefix
prefix = "DEMO-scikit-byo-iris"

# Define IAM role
import boto3
import re

import os
import json
import numpy as np
import pandas as pd
import sagemaker as sage
from sagemaker import get_execution_role

role = get_execution_role()
sess = sage.Session()

# [ ... ] build and register container

WORK_DIRECTORY = "data"
data_location = sess.upload_data(WORK_DIRECTORY, key_prefix=prefix)

def json_encode_hyperparameters(hyperparameters):
    return {str(k): json.dumps(v) for (k, v) in hyperparameters.items()}

hyperparameters = json_encode_hyperparameters({
    "hp1": "value1",
    "hp2": 300,
    "hp3": 0.001})

sess = sage.Session()
account = sess.boto_session.client("sts").get_caller_identity()["Account"]
region = sess.boto_session.region_name
image = "{}.dkr.ecr.{}.amazonaws.com/sagemaker-my_proj:latest".format(account, region)

tree = sage.estimator.Estimator(
    image,
    role,
    1,
    "ml.c4.2xlarge",
    output_path="s3://{}/output".format(sess.default_bucket()),
    sagemaker_session=sess,
    hyperparameters=hyperparameters
)

tree.fit(data_location)

My train file:

#!/usr/bin/env python

from __future__ import print_function

import json
import os
import pickle
import sys
import traceback
import argparse

import pandas as pd
from sklearn import tree

prefix = '/opt/ml/'

input_path = prefix + 'input/data'
output_path = os.path.join(prefix, 'output')
model_path = os.path.join(prefix, 'model')

channel_name='training'
training_path = os.path.join(input_path, channel_name)

# The function to execute the training.
def train(hp1, hp2, hp3):
    print('Starting the training.')
    print(hp1)
    print(hp2)
    print(hp3)
    try:
       # [ ... ] load, train, save the model
          print('Training complete.')
    except Exception as e:
        # Write out an error file. This will be returned as the failureReason in the
        # DescribeTrainingJob result.
        trc = traceback.format_exc()
        with open(os.path.join(output_path, 'failure'), 'w') as s:
            s.write('Exception during training: ' + str(e) + '\n' + trc)
        # Printing this causes the exception to be in the training job logs, as well.
        print('Exception during training: ' + str(e) + '\n' + trc, file=sys.stderr)
        # A non-zero exit code causes the training job to be marked as Failed.
        sys.exit(255)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--hp1', type=str, default="value0")
    parser.add_argument("--hp2", type=int, default=150)
    parser.add_argument("--hp3", type=float)
    
    args, _ = parser.parse_known_args()
    
    hp1 = args.hp1
    hp2 = args.hp2
    hp3 = args.hp3
    
    train(hp1, hp2, hp3)

    # A zero exit code causes the job to be marked a Succeeded.
    sys.exit(0)

This is the output I get from running the notebook:

Starting the training.
value0
150
None
Training complete.

As you can see, hyperparameters=hyperparameters did nothing to change the arguments -- they remain at their default value instead (in the case of hp3, no default value was given thus it is None)

Any help would be appreciated.

The issue reported here and the mentioned PR are not related. The PR allows passing arbitrary environment variables to a job whereas the reporter complained about the usual SM_* variables not getting injected to the job/docker environment. So far, my experience is the same: The documented env vars such as SM_MODEL_DIR e.g. does not seem to be defined in the Docker container's env when the entrypoint is called.

@satishpasumarthi Could you consider re-opening this as I am still experiencing this with bring-your-own-container? The usual SM_* variables are not injected properly (and therefore not accessible from inside the training script). Using hyperparameters may work with small things but for a big JSON-encoded dictionary like SM_TRAINING_ENV this workaround isn't practical.

We're also experiencing this with BYOC container; would love to have solved!

Hello,
I am encountring this bug when using remote mode !
Default environment variables doesn't seem to be injected directly, is there any solution to this ?
Thanks