startup speed

Question

startup speed

Closed this issue 5 months ago · 8 comments

Hi @davefojtik
I was trying the further optimize the speed and what I noticed is the docker image takes 40+ seconds in the first startup on a 4090 to start

usually, most of your requests are passed to workers that already have the app up and running so you don't experience the 40+ wait time (which actually is considered part of the Execution Time and you are paying for it) but at times that you don't have frequent requests or times when runpod throttles your workers frequently (which is happening a lot for me recently) you get the 40+ second wait when the container has started and even uvicorn is up but the app is literally doing nothing for 40+ seconds till it eventually starts generating.

At first I thought it might be natural but I did some testing on
https://github.com/runpod-workers/worker-a1111
and first startup on that endpoint is around 3 seconds

do you have any idea what is causing this?

Answer 1 · 2024-01-07T16:33:49.000Z

Yes, I'm observing this too and it's currently the most expensive flaw of this solution. I don't know if we can do something about it though as this is how the FlashBoot works.

FlashBoot is our optimization layer to manage deployment, tear-down, and scale-up activities in real time. The more popular an endpoint is, the more likely FlashBoot will help reduce cold-start.

our lowest cold-start was 563 milliseconds, and max was 42 seconds. Without FlashBoot, we would incur 42 second cold-starts

source: https://blog.runpod.io/introducing-flashboot-1-second-serverless-cold-start/

Answer 2 · 2024-01-23T06:36:23.000Z

For me, it is stuck further down at "Clearing outputs..."

2024-01-23T06:34:38.100634873Z Preload pipeline
2024-01-23T06:34:38.100637182Z Total VRAM 45416 MB, total RAM 515616 MB
2024-01-23T06:34:38.100638751Z Set vram state to: NORMAL_VRAM
2024-01-23T06:34:38.100640176Z Device: cuda:0 NVIDIA A40 : native
2024-01-23T06:34:38.100641835Z VAE dtype: torch.bfloat16
2024-01-23T06:34:38.100643365Z Using pytorch cross attention
2024-01-23T06:34:38.100644755Z INFO: 127.0.0.1:42288 - "GET /v1/generation/text-to-image HTTP/1.1" 405 Method Not Allowed
2024-01-23T06:34:38.101212404Z Fooocus API Service is ready. Starting RunPod...
2024-01-23T06:34:38.101228219Z --- Starting Serverless Worker | Version 1.3.4 ---
2024-01-23T06:34:38.214952612Z {"requestId": "04d0b057-735b-499a-b5ec-ee3886a76781-u1", "message": "Started", "level": "INFO"}
2024-01-23T06:34:38.214977852Z Clearing outputs...

Answer 3 · 2024-01-23T16:10:49.000Z

Hello @CyrusVorwald. Just to be sure - Do you mean the logs are "stuck" there for a while, or is your endpoint not generating any outputs at all? And if it's the first case, how long does it take? And does it get better when the Flashboot kicks in while in frequent use?

Answer 4 · 2024-01-23T16:44:41.000Z

Without flashboot, it takes 40+ seconds to generate an output, upwards of 60-100, every time. With flashboot, this occurs the first run but subsequent runs take about 10 seconds.

For me, the bulk of the time on the first cold start is spent further down from the gap you mentioned. It only takes about 5 seconds to generate up to the Clearing outputs part of the log, then there is no further log output for about 40 seconds, then it starts the job.

I have not added logs to the steps between when the handler function prints Clearing outputs and when the job starts, but there are many. I am not sure what the issue is because the time taken is counted as execution time, which flash boot should not impact.

Answer 5 · 2024-01-23T18:00:01.000Z

The "gap" in the logs has visually changed now with the new version because I moved outputs clearing from start.sh to handler.py. It was cached with Flashboot there before and didn't actually clean the outputs in frequent use.

In reality, the "gap" is the pipeline loading. You can easily see this when adding --preload-pipeline flag to the start.sh, which makes the pipeline load before the http server. In that case the log outputs

Service not ready yet. Retrying...
Service not ready yet. Retrying...

nonstop for ~40s, then it connects and generates the output in ~8s.

Anyway, to eliminate this we would have to find a way how to speed up loading of the Fooocus itself. I have one thing in mind. But it requires me to rework most of this repo to even test it. I'll let you know here once I make some progress.

Answer 6 · 2024-02-13T04:00:23.000Z

Good news. I made a new branch that makes the cold-starts quicker and the cached ones even a bit faster. It's a Standalone version with models and all files already baked into the container image (like the A1111 image stated above) which eliminates such long pipeline loadings. You can now choose if you want to use the network volume and be able to change things on the fly or the faster and effective version.

Give it a try and let me know if it's fast enough now, or if you have any ideas how to make it even faster. Thanks for your contributions to this repo!

Answer 7 · 2024-03-13T03:11:15.000Z

I was caching my model with the network volume before this update. I haven't tried this method, but I do run into this issue still. What's different?

Answer 8 · 2024-03-13T15:44:47.000Z

@CyrusVorwald You mean what's the difference between network and standalone?

Standalone version:

has all the files and models baked into the docker image itself, making them stored and loaded locally on the serverless instance. That means it does not have to transfer the files from network storage servers first to load them, thus faster execution times.
the downside is that if you want to change the contents (e.g. models), you have to rebuild and redeploy the whole image.
is ideal if you need the fastest and most effective endpoint with long-term predefined contents.

Network version:

is slower, but allows you to change the contents on the fly and persist the changes. Either by spinning up a pod or adding logic to the code (e.g. allowing users to add their own contents.)
is ideal if you need a more dynamic and customizable endpoint.

The v0.3.30 update also included some minor changes like setting the environment to PYTHONUNBUFFERED=1 to make console outputs in RunPod faster and more corresponding to the times of task execution. I would suggest everyone use the latest versions, as I see no advantage in staying on older ones.