replicate/replicate-python

setup timeout without meaningful logs

afpro opened this issue · 4 comments

my model keep failing setup with 'model container failed to boot and complete setup within 600 seconds'.
searched on google, no solution found.
how could I find more information about what happened?

model version: dea48a520fc0954407bfb1dd9dd3d8d4eabdb675b2cd947d6aaf302485a714ce

Hi @afpro. It looks like your model was configured to run on a T4. If the model is indeed 13B (as the name implies), the 16GB VRAM available on that hardware may not be sufficient. That'd be my guess as to why it's failing during setup. Go to the model settings and try switching the hardware to an A40 or A100.

Hi @afpro. It looks like your model was configured to run on a T4. If the model is indeed 13B (as the name implies), the 16GB VRAM available on that hardware may not be sufficient. That'd be my guess as to why it's failing during setup. Go to the model settings and try switching the hardware to an A40 or A100.

I use llama2-chat-70b on A40 and got a 'out of memory' error, in this situation, i will got a python exception stack, not 'timeout'.

I just give up.