https://dashboard.eventengine.run/login
Amazon SageMaker has built-in algorithms that let you quickly get started on extracting value out of your data. However, for customers using frameworks or libraries not natively supported by Amazon SageMaker or customers that want to use custom training/inference code, it also offers the capability to package, train and serve the models using custom Docker images.
In this workshop, you will work on this advanced use-case of building, training and deploying ML models using custom built TensorFlow Docker containers.
The model we will develop will classify news articles into the appropriate news category. To train our model, we will be using the UCI News Dataset which contains a list of about 420K articles and their appropriate categories (labels). There are four categories: Business (b), Science & Technology (t), Entertainment (e) and Health & Medicine (m).
Before we dive into the mechanics of our deep learning model, let’s explore the dataset and see what information we can use to predict the category. For this, we will use a notebook within Amazon SageMaker that we will can also utilize later on as our development machine.
Follow these steps to launch a notebook, download and explore the dataset:
1. Open the Amazon SageMaker Console, select 'Notebook instances' on the left and then click on ‘Create notebook instance’ and give the notebook a name. For the instance type, I’m going to pick ‘ml.t3.medium’ since our example dataset is small and we don’t intend on using GPUs for training/inference. We're not planning to use Elastic Inference either, so you can leave the default of ‘none’.
For the IAM role, select ‘Create a new role’ and select the options shown below for the role configuration. We don't need access to specific S3 buckets, so you can select ‘None’.
Click ‘Create role’ to create a new role. In the ‘Git repositories’ section select the option to clone a public Git repository and use this URL: https://github.com/aws-samples/amazon-sagemaker-keras-text-classification.git
Hit ‘Create notebook instance’ to submit the request for a new notebook instance.
Note: It usually takes a few minutes for the notebook instance to become available. Once available, the status of the notebook instance will change from ‘Pending’ to ‘InService’. You can move on to the next step while notebook instance is still in 'Pending' state.
2. While the notebook instance comes up, let’s go ahead and add a managed IAM policy to give the notebook instance and Amazon SageMaker access to read and write images to the Elastic Container Repository (ECR) service so that we can push the Docker images from our notebook instance and Amazon SageMaker can retrieve them for training and inference.
From the Amazon SageMaker console, click on the name of the notebook instance you just created:
From the notebook instance details page, click on the new role that you just created.
This will open up a new tab showing the IAM role details. Here click on ‘Attach policies’ and then search for ‘AmazonEC2ContainerRegistryFullAccess’ policy, select it and then click on ‘Attach policy’.
Please make sure to check the checkbox next to the policy before hitting Attach policy
3. From the Amazon SageMaker console, click ‘Open Jupyter’ to navigate into the Jupyter notebook. Under ‘New’, select ‘Terminal’. This will open up a terminal session to your notebook instance.
4. Switch into the ‘data’ directory
cd SageMaker/amazon-sagemaker-keras-text-classification/data
5. Download and unzip the dataset
wget https://archive.ics.uci.edu/ml/machine-learning-databases/00359/NewsAggregatorDataset.zip && unzip NewsAggregatorDataset.zip
6. Now lets also download and unzip the pre-trained glove embedding files (more on this in a bit):
wget http://nlp.stanford.edu/data/glove.6B.zip && unzip glove.6B.zip
Note: in case the above links don't work, please download and unzip the following file in the same directory:
https://danilop.s3-eu-west-1.amazonaws.com/reInvent-Workshop-Data-Backup.zip
7. Remove the unnecessary files
rm 2pageSessions.csv glove.6B.200d.txt glove.6B.50d.txt glove.6B.300d.txt glove.6B.zip readme.txt NewsAggregatorDataset.zip && rm -rf __MACOSX/
At this point, you should only see two files: ‘glove.6B.100d.txt’ (word embeddings) and ‘newsCorpora.csv’ (dataset) in the this data directory.
8. Go back to the Jupyter notebook web UI. You should be in the folder called ‘sagemaker_keras_text_classification’. Please launch the notebook within it with the same name. Make sure the kernel you are running is ‘conda_tensorflow_p36’.
If it’s not, you can switch it from ‘Kernel -> Change kernel’ menu:
9. Once you individually run the cells within this notebook (shift+enter) through ‘Step 1: Data Exploration’, you should see some sample data (Note: do not run all cells within the notebook – the example is designed to be followed one cell at a time):
Here we first import the necessary libraries and tools such as TensorFlow, pandas and numpy. An open-source high performance data analysis library, pandas is an essential tool used in almost every Python-based data science experiment. NumPy is another Python library that provides data structures to hold multi-dimensional array data and provides many utility functions to transform that data. TensorFlow is a widely used deep learning framework that also includes the higher-level deep learning Python library called Keras. We will be using Keras to build and iterate our text classification model.
Next we define the list of columns contained in this dataset (the format is usually described as part of the dataset as it is here). Finally, we use the ‘read_csv()’ method of the pandas library to read the dataset into memory and look at the first few lines using the ‘head()’ method.
Remember, our goal is to accurately predict the category of any news article. So, ‘Category’ is our label or target column. For this example, we will only use the information contained in the ‘Title’ to predict the category.
Since we are going to be using a custom built container for this workshop, we will need to create it. We will use this container for local testing. Once satisfied with local testing, we will push it up to Amazon Container Registery (ECR) where it can pulled from by Amazon SageMaker for training and deployment.
Instead of building a TensorFlow container from scratch, we are going to use the AWS Deep Learning (DL) Containers. AWS DL Containers are Docker images pre-installed with deep learning frameworks to make it easy to deploy custom machine learning (ML) environments quickly.
AWS DL Containers support TensorFlow, PyTorch, and Apache MXNet. We are going to use TensorFlow today. You can deploy AWS DL Containers on Amazon Sagemaker, Amazon Elastic Kubernetes Service (Amazon EKS), self-managed Kubernetes on Amazon EC2, Amazon Elastic Container Service (Amazon ECS). We are going to deploy using SageMaker for training and inference. The containers are available through Amazon Elastic Container Registry (Amazon ECR) and AWS Marketplace at no cost, you pay only for the resources that you use. In this workshop, we a re going to download the AWS DL Containers images via ECR.
For your reference, all available AWS DL Containers images are described in the documentation:
https://docs.aws.amazon.com/dlami/latest/devguide/deep-learning-containers-images.html
1. Change directory to be in the path where we are going to create the custom container:
cd ~/SageMaker/amazon-sagemaker-keras-text-classification/container/
2. Create a new Dockerfile using vim Dockerfile
, hit i
to insert and then paste the content below. In the line starting with FROM
, replace REGION
with the AWS region you are using today, for example us-east-1
. To replace a word with vim
, hit Escape and then cw
, finally type the new word.
# Build an image that can do training and inference in SageMaker
FROM 763104351884.dkr.ecr.REGION.amazonaws.com/tensorflow-training:1.14.0-cpu-py36-ubuntu16.04
RUN apt-get update && \
apt-get install -y nginx
RUN pip install gevent gunicorn flask
ENV PATH="/opt/program:${PATH}"
# Set up the program in the image
COPY sagemaker_keras_text_classification /opt/program
WORKDIR /opt/program
Hit Escape and then :wq
to save and exit vim.
We start from the base
image, add the code directory to our path, copy the code into that directory and finally set the WORKDIR to the same path so any subsequent RUN/ENTRYPOINT commands run by Amazon SageMaker will use this directory.
3. Build the custom image
Note: Slow down and read the below instruction very carefully. The next command(aws ecr get-login...) is a two step process. First, run the below command(aws ecr get-login...); Second, copy output of command(aws ecr get-login...) and run it as a command.
Run the following docker login command
Step 1:
aws ecr get-login --no-include-email --region <AWS REGION such as us-east-1> --registry-ids 763104351884
Step 2:
<copy and run the output emitted by above command without any changes>
Next, run following command
docker build -t sagemaker-keras-text-class:latest .
Once we are finished developing the training portion (in ‘container/train’), we can start testing locally so we can debug our code quickly. Local test scripts are found in the ‘container/local_test’ subfolder. Here we can run ‘train_local.sh’ which will, in turn, run a Docker container within which our training code will execute.
1. The local testing framework expects the training data to be in the ‘/container/local_test/test_dir/input/data/training’ folder so let’s copy over the contents of our ‘data’ folder there.
In the notebook instance terminal window, switch over to the ‘sagemaker-keras-text-classification/data’ directory
cd ~/SageMaker/amazon-sagemaker-keras-text-classification/data
and then run:
cp -a . ../container/local_test/test_dir/input/data/training/
2. Switch into the ‘local_test’ directory
cd ../container/local_test
3. Run the following command to run the training locally.
./train_local.sh sagemaker-keras-text-class:latest
Note: it might take anywhere from 2-3 minutes to complete for the local training to complete.
With an 80/20 split between the training and validation and a simple Feed Forward Neural Network, we get around 78-80% validation accuracy (val_acc) after two epochs – not a bad start!
Try reaching 84-85% validation accouracy before going to the next step. You can test different network architectures (using different hyperparameters) by editing ~/SageMaker/amazon-sagemaker-keras-text-classification/container/train
and create more docker containers (each with a unique name) that you can train local (you don't need to edit the Dockerfile).
Go back in the container
directory:
cd ~/SageMaker/amazon-sagemaker-keras-text-classification/container/
Edit the train
file to update your network architecture:
vim ./sagemaker_keras_text_classification/train
If the network architecture is too "small", then it may be incapable of "learn" from the traning data and accourancy cannot increase beyond a certain point. That is a case of underfitting. If the network architecture is too "complex", it can learn to fit to the training data so very well, but then the model is not capable of working on new data points, so the validation accourancy is much lower than the training accourancy.
Here's the part of the train
file where you can change the network architecture to have more units or add new layers:
# ------Architecture: MLP------------------------------
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(2, activation='relu')) # Try 2-32 units (dimensionality)
# model.add(tf.keras.layers.Dense(2, activation='relu')) # Try adding more layers uncommenting this line and changing the units
#------------------------------------------------------
Create a new container image with a name that reflects your changes:
docker build -t sagemaker-keras-text-class-2units-2layers:latest .
Test again your training locally, in the notebook instance, to see if your validation accouracy improved:
cd ../container/local_test
./train_local.sh sagemaker-keras-text-class-2units-2layers:latest
When running training locally multiple times, you should confirm when asked to remove some write-protected regular files. Those files are the model output of the previous training (the ‘news_breaker.h5’ and the ‘tokenizer.pickle’ files).
Note: for each test it might take anywhere from 2-3 minutes to complete for the local training to complete, we recommend you do 2-3 tests and them move forward with the best result you got.
We now have a saved model called ‘news_breaker.h5’ and the ‘tokenizer.pickle’ file within ‘sagemaker-keras-text-classification/container/local_test /test_dir/model’ – the local directory that we mapped to the ‘/opt/ml’ directory within the container.
It is also advisable to locally test and debug the interference Flask app so we don’t waste time debugging it when we deploy it to Amazon SageMaker.
4. Start local testing by running ‘serve_local.sh’ in the ‘local_test’ directory (we should be in it already after completing previous step):
./serve_local.sh sagemaker-keras-text-class:latest
This is a simple script that uses the ‘Docker run’ command to start the container and the Flask app that we defined previously in the serve
file.
5. Now open another terminal, move to the local_test
directory and run ‘predict.sh’. This script issues a request to the flask app using the test news headline in input.json
:
cd SageMaker/amazon-sagemaker-keras-text-classification/container/local_test && cat input.json && ./predict.sh input.json application/json
Great! Our model inference implementation responds and is correctly able to categorize this headline as a Health & Medicine story.
Now that we are done testing locally, we are ready to package up our code and submit to Amazon SageMaker for training or deployment (hosting) or both.
1. We should probably modify our training code to take advantage of the more powerful hardware. Let’s update the number of epochs in the ‘train’ script to 10
to see how that impacts the validation accuracy of our model while training on Amazon SageMaker. This file is located in 'sagemaker_keras_text_classification' directory. Navigate there by
cd ../sagemaker_keras_text_classification/
and edit the file named 'train'
history = model.fit(x_train, y_train,
epochs=10,
batch_size=32,
validation_data=(x_test, y_test))
2. Open the ‘sagemaker_keras_text_classification.ipynb’ notebook and follow the steps listed in Lab 4 to upload the data to S3, submit the training job and, finally, deploy the model for inference. The notebook contains explanations for each step and also shows how to test your inference endpoint.
This lab will demonstrate both SageMaker Parameter Server framework as well as SageMaker's Script mode.
A common pattern in distributed training is to use dedicated processes to collect gradients computed by “worker” processes, then aggregate them and distribute the updated gradients back to the workers. These processes are known as parameter servers. In general, they can be run either on their own machines or co-located on the same machines as the workers. In a parameter server cluster, each parameter server communicates with all workers (“all-to-all”). The Amazon SageMaker prebuilt TensorFlow container comes with a built-in option to use parameter servers for distributed training. The container runs a parameter server thread in each training instance, so there is a 1:1 ratio of parameter servers to workers. With this built-in option, gradient updates are made asynchronously (though some other versions of parameters servers use synchronous updates).
For this lab, we will be instantiating CPU compute nodes for simplicity and scalability.
Previously (as in Lab 2-4 of this workshop), in BringYourOwnContainer situation, a user had to make his/her training Python script a part of the container. Therefore, during the debug process, every Python script change required rebuilding the container. SageMaker's "script mode" allows one to build the container once and then debug and change a python script without rebuilding the container with every change. Instead, a user specifies script's "entry point" via entry_point='myscript.py' and script_mode=True parameter.
Script Mode requires a training script, which in this case is the sentiment.py file in the /distributed training subdirectory of the related distributed training example GitHub repository. Once a training script is ready, the next step is to set up an Amazon SageMaker TensorFlow Estimator object with the details of the training job. It is very similar to an Estimator for training on a single machine, except we specify a distributions parameter to enable starting a parameter server on each training instance.
distributions = {'parameter_server': {'enabled': True}}
hyperparameters = {'epochs': 10, 'batch_size': 128}
estimator = TensorFlow(
source_dir='tf-sentiment-script-mode',
entry_point='sentiment.py',
model_dir=model_dir,
train_instance_type=train_instance_type,
train_instance_count=instance_count,
hyperparameters=hyperparameters,
role=sagemaker.get_execution_role(),
base_job_name='tf-keras-sentiment',
framework_version='1.13',
py_version='py3',
distributions = distributions,
script_mode=True)
- Open the ‘sentiment-analysis.ipynb’ notebook located in "distributed training" directory and follow the flow.
Dataset: Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
Glove Embeddings: Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. [pdf] [bib]
This library is licensed under the Apache 2.0 License.