dwyl/image-classifier

Chore: Get Help with Lowering Machine Start Time for `imgai` App on Fly.io

nelsonic opened this issue ยท 8 comments

Just visited https://imgai.fly.dev/ while watching the logs:
https://fly.io/apps/imgai/monitoring

2023-11-21T09:33:01.113 proxy[d891394fee9318] mad [info] Starting machine

2023-11-21T09:33:01.473 app[d891394fee9318] mad [info] [ 0.175894] Spectre V2 : WARNING: Unprivileged eBPF is enabled with eIBRS on, data leaks possible via Spectre v2 BHB attacks!

2023-11-21T09:33:01.595 app[d891394fee9318] mad [info] [ 0.213605] PCI: Fatal: No config space access function found

2023-11-21T09:33:01.802 app[d891394fee9318] mad [info] INFO Starting init (commit: 15238e9)...

2023-11-21T09:33:01.872 app[d891394fee9318] mad [info] INFO Mounting /dev/vdb at /app/.bumblebee w/ uid: 65534, gid: 65534 and chmod 0755

2023-11-21T09:33:01.874 app[d891394fee9318] mad [info] INFO Resized /app/.bumblebee to 3204448256 bytes

2023-11-21T09:33:01.875 app[d891394fee9318] mad [info] INFO Preparing to run: `/app/bin/server` as nobody

2023-11-21T09:33:01.881 app[d891394fee9318] mad [info] INFO [fly api proxy] listening at /.fly/api

2023-11-21T09:33:01.889 app[d891394fee9318] mad [info] 2023/11/21 09:33:01 listening on [fdaa:3:7f9d:a7b:1be:5453:bd12:2]:22 (DNS: [fdaa::3]:53)

2023-11-21T09:33:02.312 proxy[d891394fee9318] mad [info] machine started in 1.198962695s

2023-11-21T09:33:03.883 app[d891394fee9318] mad [info] WARN Reaped child process with pid: 406 and signal: SIGUSR1, core dumped? false

2023-11-21T09:33:06.832 app[d891394fee9318] mad [info] 09:33:06.831 [info] TfrtCpuClient created.

2023-11-21T09:33:07.871 proxy[d891394fee9318] mad [info] waiting for machine to be reachable on 0.0.0.0:8080 (waited 5.558674632s so far)

2023-11-21T09:33:10.849 proxy[d891394fee9318] mad [error] failed to connect to machine: gave up after 15 attempts (in 8.536197985s)

2023-11-21T09:33:10.947 proxy[d891394fee9318] mad [error] instance refused connection. is your app listening on 0.0.0.0:8080? make sure it is not only listening on 127.0.0.1 (hint: look at your startup logs, servers often print the address they are listening on)

2023-11-21T09:33:21.403 proxy[d891394fee9318] mad [error] instance refused connection. is your app listening on 0.0.0.0:8080? make sure it is not only listening on 127.0.0.1 (hint: look at your startup logs, servers often print the address they are listening on)

2023-11-21T09:33:26.140 app[d891394fee9318] mad [info] 09:33:26.140 [info] Running AppWeb.Endpoint with cowboy 2.10.0 at :::8080 (http)

2023-11-21T09:33:26.140 app[d891394fee9318] mad [info] 09:33:26.140 [info] Access AppWeb.Endpoint at https://imgai.fly.dev

2023-11-21T09:33:30.408 app[d891394fee9318] mad [info] 09:33:30.406 request_id=F5mZl4fQ6dppPnMAADEB [info] GET /

2023-11-21T09:33:30.408 app[d891394fee9318] mad [info] 09:33:30.408 request_id=F5mZl4fQ6dppPnMAADEB [info] Sent 200 in 1ms

2023-11-21T09:33:30.675 app[d891394fee9318] mad [info] 09:33:30.674 [info] CONNECTED TO Phoenix.LiveView.Socket in 23ยตs

2023-11-21T09:33:30.675 app[d891394fee9318] mad [info] Transport: :websocket

2023-11-21T09:33:30.675 app[d891394fee9318] mad [info] Serializer: Phoenix.Socket.V2.JSONSerializer

2023-11-21T09:33:30.675 app[d891394fee9318] mad [info] Parameters: %{"_csrf_token" => "GUgqOm53Vx5HFQE-SDAkKQQCGTgEBxMqZ9PW42-W-OjQxeNnkwcpk5tB", "_live_referer" => "undefined", "_mounts" => "0", "_track_static" => %{"0" => "https://imgai.fly.dev/assets/app-25b5c2cfcdb3041cab0cdbc67bcf91d9.css?vsn=d", "1" => "https://imgai.fly.dev/assets/app-956037311f0a3945ed53c93e204e141c.js?vsn=d"}, "vsn" => "2.0.0"}

The two log entries that we care about are:

2023-11-21T09:33:01.113 -  Starting machine
2023-11-21T09:33:26.140 - Access AppWeb.Endpoint at https://imgai.fly.dev

Takes 25 seconds to start.

Todo

  • Please open an thread on https://community.fly.io to the effect of: "How to Speed up Start Time for an LLM-based App on Fly.io?" ๐Ÿ™
  • Put as much detail as you can for how you have configured the machine and link back to this repo. ๐Ÿ”—

Remember: the purpose isn't just getting an answer, it's helping others understand the problem and solve it in the future
and the obvious side benefit is feedback on the repo/project before posting to HN. ๐Ÿ˜‰

I can post a thread on the forums after I'm finishing with "completing" the application.

Though I doubt there's anything I can realistically do. I've followed the official advice from fly.io from their blogposts. If the app is taking 25 seconds to bootup, it's because it's loading the model (which is fairly large) into memory and waking the process up. I don't think I have any control over it.

@LuchoTurtle Yeah, don't worry, I don't expect miracles in terms of boot time. ๐Ÿ’ญ
Just to not leave any stone unturned in terms of our quest ... ๐Ÿฆธ
It's a good habit to get into in terms of getting help from the wider community. ๐Ÿ’ฌ
(you've done it a few times in the past, I know. ๐Ÿ‘ just good to continue doing ...)
If you cross-post it to the https://elixirforum.com it'll get even more eyes on it. ๐Ÿ‘€

Definitely mention all the blog posts / docs you followed. ๐Ÿ‘Œ

Created the question in https://community.fly.io/t/how-to-speed-up-start-up-time-for-a-bumblebee-app-on-fly-io/17155.

Closing this issue for now. If there are any updates on this (and if there's anything I can do to speed this up), I'll re-open this issue according to the responses to the thread.

ndrean commented

@LuchoTurtle Maybe look at https://fly.io/docs/flyctl/volumes-fork/. I can't test this.

@ndrean I'm already using a volume. All of the models are being persisted there and I know they are being loaded from there (it's being logged). The purpose of this issue is that even if it's being loaded from storage, it takes a while for the app to boot up.

It's understandable given that the model is fairly large and it takes roughly 25 seconds to load it into the CPU. I was wondering if there's anything I could do programmatically instead of scaling up the machine. :P

ndrean commented

Ok ok, forgive me @LuchoTurtle if I didn't take the time to look more closely at your work. And forgive me if the following is paraphrasing your code. My first idea was not to embed the model in the image - the machine start-up is short - but rather to ssh into the machine and trigger "by hand" the download into a volume, once and for all, and reference it by {:local, path} (I felt it was more simple than the Cache_dir). Furthermore, I used a GenServer with a handle_continue to offset the load. However, I did not pursue this because I still needed a more powerful Fly machine than the free tier, and the Fly images are constrained in size. But this was more than a month ago, so things could have changed. I just focussed on making this work, and it does, but locally only sadly.

The way this app is implemented is that it fetches the model locally by default to a path in the volume. If nothing is there, it downloads the model. It's not downloaded in the Dockerfile.

Therefore, downloading the model doesn't need to be triggered manually - it will occur only one time automatically and it's on the first time it starts up and there's no model there. So no need to ssh into the machine :D

I've written about my approach in https://github.com/dwyl/image-classifier/blob/main/deployment.md#5-a-better-model-management

ndrean commented

Great! More simple indeed. Will look into it.