Deploying HuggingFace model/pipeline using uvicorn-gunicorn-fastapi-docker on Google Cloud Run

Question

Deploying HuggingFace model/pipeline using uvicorn-gunicorn-fastapi-docker on Google Cloud Run

GiorgioBarnabo opened this issue 2 years ago · 2 comments

Hi everybody,

I am pretty new to web app development and have doubts about how to make the best out of this incredible docker image.
In short, I have been trying to deploy an huggingface pipeline on Google Cloud Run using the uvicorn-gunicorn-fastapi-docker image. The model takes about 3.5GB, while the base cloud-run instance can have up to 16 vCPUs and 32GB of RAM. At deployment time, I also need to manually specify the maximum number of concurrent requests before autoscaling happens.

How should I set up the number of workers/threads for gunicorn/uvicorn, and the characteristics of the base cloud run instance? I noticed that, for every additional worker and/or thread, 3.5GB of RAM are needed. Also, during execution, memory leakage occurs, which would require a worker to be restarted every now and then.

My naif guess is that I should have as many workers as the number of vCPU and a RAM of at least 3.5GB times the number of workers. Is that correct? What about the number of concurrent requests?

Right now, my uvicorn command in the dockerfile looks like this:

CMD uvicorn main:app --host 0.0.0.0 --port 8080 --workers 4 --access-log --use-colors

Nonetheless, with this setting, after a while the RAM gets saturated that the service breaks down :(

Any help is more then welcome.

Thank you in advance. Best

Answer 1 · 2023-07-14T06:26:26.000Z

If you use def for the fastapi function, it creates a new thread (from a threadpool) for each incoming request. The model has a single copy in GPU.

If you use async def with N workers, it creates a total of N forks. Each request is handled by one of these N forks. For each of the N workers, there is a copy of the model in the GPU. Workers don't share memory or other resources.

To decide the number of workers: N = number of threads + 1.
You also need more than enough GPU to fit N copies of the model.

So if you are GPU limited, that's your criteria to decide the number of workers.

What I wrote above is based on what I observed in a few tests. It might well be incorrect.

Answer 2 · 2023-07-14T15:19:12.000Z

I would also recommend using Gunicorn instead of Uvicorn to run the app