philschmid/sagemaker-huggingface-llama-2-samples

CUDA error encountered on both BS=2&3 for 7B and 13B.

alfredcs opened this issue · 1 comments

File "/opt/conda/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/opt/conda/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 516, in forward
output = torch.nn.functional.linear(A, F.dequantize_4bit(B, state).to(A.dtype).t(), bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)
0%| | 0/2052 [00:00<?, ?it/s]
2023-07-20 15:52:05,435 sagemaker-training-toolkit INFO Waiting for the process to finish and give a return code.
2023-07-20 15:52:05,435 sagemaker-training-toolkit INFO Done waiting for a return code. Received 1 from exiting process.
2023-07-20 15:52:05,435 sagemaker-training-toolkit ERROR Reporting training FAILURE
2023-07-20 15:52:05,435 sagemaker-training-toolkit ERROR ExecuteUserScriptError:
ExitCode 1
ErrorMessage "RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)
0%| | 0/2052 [00:00<?, ?it/s]"
Command "/opt/conda/bin/python3.10 run_clm.py --dataset_path /opt/ml/input/data/training --epochs 3 --hf_token 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 --lr 0.0002 --merge_weights True --model_id meta-llama/Llama-2-13b-hf --per_device_train_batch_size 2"
2023-07-20 15:52:05,435 sagemaker-training-toolkit ERROR Encountered exit_code 1

Looks like a hardware issue, can you retry?