Triton issues
prd-hung-trinh opened this issue · 1 comments
prd-hung-trinh commented
I tried to fine-tune https://huggingface.co/CATIE-AQ/FAT5-small-flan-en but I got this error on A10g / AWS EC2:
triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 163840, Hardware limit: 101376. Reducing block sizes or num_stages may help.
And here is training params:
{"epochs":2,"max_length":2048,"batch_size":16,"per_device_train_batch_size":2,"per_device_eval_batch_size": 2,"learning_rate":1e-5,"warmup_steps":150,"evaluation_steps": 100,"use_amp":false}
b-albar commented
Hi, the triton kernel was designed for the A100 or higher. Other GPUs may not have enough shared memory as it is the case here. It may be possible to adapt the kernel, but it may be tricky.