Training Job "Successful" despite failing due to 100% disk usage
david-waterworth opened this issue · 0 comments
Describe the bug
I ran a training job as part of a sagemaker pipeline. The model by default wrote checkpoints and after epoch 2 of 10 disck utilisation reached 100%.
Despite abnormal exit from the training script, the training job and hence pipeline step was reported as successful.
To reproduce
I used the HuggingFace
estimator with the following parameters
instance_type="ml.g4dn.xlarge",
role=role,
transformers_version="4.28",
pytorch_version="2.0",
py_version="py310",
The model is a sentence-transformers
model (installed using requirements.txt). I inadvertently enabled checkpoints hence the out of disk issue.
Cloudwatch logs indicate abnormal termination, i.e.
2023-11-07T11:49:54.665000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 2023-11-07 11:49:54 - Save model to /opt/ml/checkpoints/242000
2023-11-07T11:49:54.665000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 Epoch: 20%|██ | 2/10 [13:42:59<39:37:08, 17828.58s/it]
2023-11-07T11:49:54.665000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 #015Iteration: 77%|███████▋ | 67255/87372 [3:48:53<1:08:41, 4.88it/s]#033[A
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 Iteration: 77%|███████▋ | 67255/87372 [3:48:54<1:08:28, 4.90it/s]
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 Epoch: 20%|██ | 2/10 [13:43:01<54:52:04, 24690.56s/it]
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 ╭───────────────────── Traceback (most recent call last) ──────────────────────╮
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │ /opt/conda/lib/python3.10/site-packages/torch/serialization.py:441 in save │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │ │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │ 438 │ │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │ 439 │ if _use_new_zipfile_serialization: │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │ 440 │ │ with _open_zipfile_writer(f) as opened_zipfile: │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │ ❱ 441 │ │ │ _save(obj, opened_zipfile, pickle_module, pickle_protocol │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │ 442 │ │ │ return │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │ 443 │ else: │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │ 444 │ │ with _open_file_like(f, 'wb') as opened_file: │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │ │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │ /opt/conda/lib/python3.10/site-packages/torch/serialization.py:668 in _save │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │ │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │ 665 │ │ │ storage = storage.cpu() │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │ 666 │ │ # Now that it is on the CPU we can directly copy it into the │
2023-11-07T11:49:56.666000+00:00 pipelines-jz2u9wqwy37v-TrainModel-mtxngHJgFu/algo-1-1699307977 │ 667 │ │ num_bytes = storage.nbytes() │
The training job charts show the disk utilisation hitting 100%
But the training job status is "complete", the abnormal termination wasn't detected.
Expected behavior
Sagemaker pipeline steps shouldn't report success unless the script terminated normally.