This repostiory implements a cutsom sagemaker-sdk extension for the HuggingFace libraries. The repostoriy is split into 3 parts. First there is docker/
, which contains the dockerfiles
and scripts to create the AWS DLC for HuggingFace. Second there is hugginface/
, which includes the custom HuggingFace()
extension for the sagemaker-sdk. Lastly there is another folder examples/
, which contains multiple examples on how to use the HuggingFace()
extension for sagemaker-sdk
This Repository contains multiple examples "how to use the transformers and datasets library from HuggingFace with AWS Sagemaker". All Notebooks can be run locally or within AWS Sagemaker Studio.
- Container Image List
- Script Arguments
- Build and Push Container Example
- How to use HuggingFace sagemaker-sdk extension
- Example Overview
You can build different HuggingFace Deep Learning Container to use them in AWS Sagemaker.
for training
- a gpu container based on AWS DLC Pytorch1.6
- a cpu container based on AWS DLC Pytorch1.6
- a test container just checking if the parameters are passed correctly
for inference
- a gpu container based on AWS DLC Pytorch1.6
- a cpu container based on AWS DLC Pytorch1.6
- a test container just checking if the parameters are passed correctly
NOTE: Sadly Sagemaker doesn´t support public.ecr.aws
images so therefore, we have to used container in private
registry. To use a private registry you can use the docker/build_push_private_ecr.sh
script to build and push the container to your private ecr registry.
type | device | base | python-version | transformers-version | datasets-version | public-URL |
---|---|---|---|---|---|---|
training | cpu | aws dlc pytorch1.6.0-cpu-py36-ubuntu16.04 | 3.6.10 | 4.1.1 | 1.1.3 | public.ecr.aws/t6m7g5n4/huggingface-training:0.0.1-cpu-transformers4.1.1-datasets1.1.3 |
training | gpu | aws dlc pytorch1.6.0-gpu-py36-cu110-ubuntu16.04 | 3.6.10 | 4.1.1 | 1.1.3 | public.ecr.aws/t6m7g5n4/huggingface-training:0.0.1-gpu-transformers4.1.1-datasets1.1.3-cu110 |
inference | cpu | aws dlc pytorch1.6.0-cpu-py36-ubuntu16.04 | 3.6.10 | 4.1.1 | 1.1.3 | |
inference | gpu | aws dlc pytorch1.6.0-gpu-py36-cu110-ubuntu16.04 | 3.6.10 | 4.1.1 | 1.1.3 |
You can pass mutliple named arguments to the script.
parameter | default | description |
---|---|---|
--image_type | training | The container image type either training, inference |
--device | cpu | The container device either cpu, gpu or test |
--account_id | 558105141721 | The aws account_id of the aws account/registry |
--profile | default | The aws profile which going to be used. Pass ci for CI-Pipelines |
--transformers_version | 4.1.1 | The transformers version which will be installed in the container |
--datasets_version | 1.1.3 | The datasets version which will be installed in the container |
--version | 0.0.1 | The container version |
usage
./docker/build_push_private_ecr.sh --device gpu --type training --version 1.0.0
Since public.ecr is not supported by sagemaker currently you have to build the docker image for yourself and upload it to your private ecr registry.
GPU Container Training
cd docker && ./build_push_private_ecr.sh --device gpu --image_type training --profile hf-sm
GPU Container Inference
./docker/build_push_private_ecr.sh --device gpu --image_type inference
CPU Container Training
./docker/build_push_private_ecr.sh --device cpu --image_type training
CPU Container Inference
./docker/build_push_private_ecr.sh --device cpu --image_type inference
This Repository contains multiple examples "how to use the transformers and datasets library from HuggingFace with AWS Sagemaker". All Notebooks can be run locally or within AWS Sagemaker Studio.
example strucute
Each folder starting with 0X_...
contains an sagemaker example.
Each example contains a jupyter notebook sagemaker-example.ipynb
, which is used to start train job on AWS Sagemaker or preprocess data.
As explained above, you are able to run these examples either on your local machine or in the AWS Sagemaker Studio.
example | description |
---|---|
01_basic_example_huggingface_extension | This example uses the custom HuggingFace sagemaker extension. In the fine-tuning scripts, it uses the Trainer class. The dataset is processed in jupyter notebook with the datasets library and then uploaded to S3. |
02_spot_instances_with_huggingface_extension | It is the same example as 01_basic_example_huggingface_extension , but we will use ec2 spot instances for training, which can reduce the training cost up to 90% |
03_track_custom_metrics_huggingface_extension | It is the same example as 02_spot_instances_with_huggingface_extension , but we will use custom_metrics to track validation metrics in our training job and plot them into the notebook |
04_track_experiments_huggingface_extension | It is the same example as 02_spot_instances_with_huggingface_extension , but we will use sagemaker-experiment to track logs and metrics from our training job and use it to compare hyperparameter tuning training jobs. |
05_upload_to_model_hub | It is the same example as 01_basic_example_huggingface_extension , but we will upload the model at the end to the huggingface model hub. |
06_transformers_existing_training_scripts | This examples uses a existing fine-tuning script from the transformers repository |
If you want to use an example on your local machine, you need:
- an AWS Account
- configured AWS credentials on your local machine,
- an AWS Sagemaker IAM Role
If you don´t have an AWS account you can create one here. To configure AWS credentials on your local machine you can take a look here. Lastly, to create an AWS Sagemaker IAM Role you can take a look here, beaware if you change the name of the role, you have to adjust it in the jupyter notebook. Now you have to install dependencies from the requirements.txt
and you are good to go.
pip install -r requirements.txt
If you want to use an example in sagemaker studio. You can open your sagemaker studio and then clone the github repository. Afterwards you have to install dependencies from the requirements.txt
.
- If you get an
UnknownServiceError
withUnknown service: 'sagemaker-featurestore-runtime'
runpip install -r requirements.txt --upgrade
and restart your jupyter runtime.