aws/sagemaker-training-toolkit

Training Job "Successful" despite failing due to 100% disk usage

david-waterworth opened this issue · 0 comments

Describe the bug
I ran a training job as part of a sagemaker pipeline. The model by default wrote checkpoints and after epoch 2 of 10 disck utilisation reached 100%.

Despite abnormal exit from the training script, the training job and hence pipeline step was reported as successful.

To reproduce
I used the HuggingFace estimator with the following parameters

instance_type="ml.g4dn.xlarge",
role=role,
transformers_version="4.28",
pytorch_version="2.0", 
py_version="py310", 

The model is a sentence-transformers model (installed using requirements.txt). I inadvertently enabled checkpoints hence the out of disk issue.

Cloudwatch logs indicate abnormal termination, i.e.

2023-11-07T11:49:54.665000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 2023-11-07 11:49:54 - Save model to /opt/ml/checkpoints/242000
2023-11-07T11:49:54.665000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 Epoch:  20%|██        | 2/10 [13:42:59<39:37:08, 17828.58s/it]
2023-11-07T11:49:54.665000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 #015Iteration:  77%|███████▋  | 67255/87372 [3:48:53<1:08:41,  4.88it/s]#033[A
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 Iteration:  77%|███████▋  | 67255/87372 [3:48:54<1:08:28,  4.90it/s]
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 Epoch:  20%|██        | 2/10 [13:43:01<54:52:04, 24690.56s/it]
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 ╭───────────────────── Traceback (most recent call last) ──────────────────────╮
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │ /opt/conda/lib/python3.10/site-packages/torch/serialization.py:441 in save   │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │                                                                              │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │    438 │                                                                     │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │    439 │   if _use_new_zipfile_serialization:                                │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │    440 │   │   with _open_zipfile_writer(f) as opened_zipfile:               │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │ ❱  441 │   │   │   _save(obj, opened_zipfile, pickle_module, pickle_protocol │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │    442 │   │   │   return                                                    │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │    443 │   else:                                                             │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │    444 │   │   with _open_file_like(f, 'wb') as opened_file:                 │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │                                                                              │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │ /opt/conda/lib/python3.10/site-packages/torch/serialization.py:668 in _save  │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │                                                                              │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │    665 │   │   │   storage = storage.cpu()                                   │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │    666 │   │   # Now that it is on the CPU we can directly copy it into the  │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │    667 │   │   num_bytes = storage.nbytes()                                  │

The training job charts show the disk utilisation hitting 100%

image

But the training job status is "complete", the abnormal termination wasn't detected.

image

Expected behavior
Sagemaker pipeline steps shouldn't report success unless the script terminated normally.