Model Failing when using HuggingFace pipeline
saptarshi059 opened this issue · 4 comments
Hi,
So I've been trying to use very basic text generation inference with this model using HuggingFace's pipeline API. However, it keeps on crashing when trying to generate sequences with max_tokens = 10000
from transformers import pipeline
generator = pipeline('text-generation', model = 'facebook/galactica-125m', device=0)
generator('covid-19', renormalize_logits=True, do_sample=True, max_new_tokens=10000)[0]['generated_text']
I updated my Transformers and Torch libraries. CUDA version = 11.7 | torch = 1.14.0 (nightly) [Stable also was not working] | transformers = 4.25.1
GPUs = NVIDIA A100
Error:
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasLtMatmul( ltHandle, computeDesc.descriptor(), &alpha_val, mat1_ptr, Adesc.descriptor(), mat2_ptr, Bdesc.descriptor(), &beta_val, result_ptr, Cdesc.descriptor(), result_ptr, Cdesc.descriptor(), &heuristicResult.algo, workspace.data_ptr(), workspaceSize, at::cuda::getCurrentCUDAStream())
Hi @saptarshi059, all the models were trained with context window of 2048 tokens. Does the above code work if you set max_new_tokens=100
?
@mkardas This explains the errors I have been getting, I am assuming one's meant to program it to be able to write longer, perhaps making it run in a loop with a portion of the previous output as its prompt.
Yes, this kind of moving window approach should work, but is not provided out of the box.
@mkardas Oh I see. Thank you so much. Yes, it does work with max_new_tokens < 2048. I will try the moving window approach then.