superfly/litefs

Proxy Server not starting up?

Closed this issue · 9 comments

I tried deploying today and got an error. I upgraded to the latest version and received the same error. It appears to be due to the proxy server not starting up. Here are relevant logs from my server during the startup:

Pulling container image registry.fly.io/kcd:deployment-01HK8RHG7YCC6VMEHH0HVGGJ6B
Successfully prepared image registry.fly.io/kcd:deployment-01HK8RHG7YCC6VMEHH0HVGGJ6B (3.142341191s)
Configuring firecracker
[    0.053673] PCI: Fatal: No config space access function found
 INFO Starting init (commit: 15238e9)...
 INFO Preparing to run: `docker-entrypoint.sh node ./other/sentry-create-release` as root
 INFO [fly api proxy] listening at /.fly/api
2024/01/03 22:54:48 listening on [fdaa:0:23df:a7b:15c:75cd:f4be:2]:22 (DNS: [fdaa::3]:53)
> Found 262 files
> Analyzing 262 sources
> Rewriting sources
> Adding source map references
> Bundled 262 files for upload
> Bundle ID: 1fea6fe0-72f9-51f5-8ae5-a841dc3156ca
> Nothing to upload, all files are on the server
> File upload complete (processing pending on server)
> Organization: kent-c-dodds-tech-llc
> Project: kcd-node
> Release: 70189f2167cd62a6c58d5ae8c76839402092e0e3
> Dist: None
> Upload type: artifact bundle
Source Map Upload Report
  Minified Scripts
    #build/_shared/chunk-26EPNH6W.js (sourcemap at /build/_shared/chunk-26EPNH6W.js.map)
    ... etc ...
  Source Maps
    #build/_assets/app-3QAQV6F7.css.map
    ... etc ...
 INFO Main child exited normally with code: 0
 INFO Starting clean up.
 WARN hallpass exited, pid: 307, status: signal: 15 (SIGTERM)
2024/01/03 22:54:54 listening on [fdaa:0:23df:a7b:15c:75cd:f4be:2]:22 (DNS: [fdaa::3]:53)
[    7.334953] reboot: Restarting system
machine restart policy set to 'no', not restarting

That's not entirely helpful. Here are the deploy logs:

deployment-01HK8RHG7YCC6VMEHH0HVGGJ6B: digest: sha256:b149635ea304b10706b218e4bd6dd1fd146fb40b554f77f95f30a05e6eb04202 size: 4299
--> Pushing image done
image: registry.fly.io/kcd:deployment-01HK8RHG7YCC6VMEHH0HVGGJ6B
image size: 1.5 GB

Watch your deployment at https://fly.io/apps/kcd/monitoring

Running kcd release_command: node ./other/sentry-create-release
> Created release_command machine 4d891404a70638
> Waiting for 4d891404a70638 to have state: started
> Machine 4d891404a70638 has state: started
> Waiting for 4d891404a70638 to have state: destroyed
> Machine 4d891404a70638 has state: destroyed
> Waiting for 4d891404a70638 to get exit event
✔ release_command 4d891404a70638 completed successfully
Updating existing machines in 'kcd' with rolling strategy
> [01/10] Updating 148ed669b9e389 [app]
> [01/10] Updating 148ed669b9e389 [app]
> [01/10] Waiting for 148ed669b9e389 [app] to have state: started
> [01/10] Machine 148ed669b9e389 [app] has state: started
> [01/10] Checking that 148ed669b9e389 [app] is up and running
> [01/10] Waiting for 148ed669b9e389 [app] to become healthy: 0/3


WARNING The app is not listening on the expected address and will not be reachable by fly-proxy.
You can fix this by configuring your app to listen on the following addresses:
✖ [01/10] Machine 148ed669b9e389 [app] update failed: timeout reached waiting for health checks to pass for machine 148ed669b9e389: failed to get VM 148ed669b9e389: Get "https://api.machines.dev/v1/apps/kcd/machines/148ed669b9e389": net/http: request canceled
  - 0.0.0.0:8080
Found these processes inside the machine with open listening sockets:
  PROCESS        | ADDRESSES                             
-----------------*---------------------------------------
  litefs mount   | [::]:20202                            
  /.fly/hallpass | [fdaa:0:23df:a7b:176:[260](https://github.com/kentcdodds/kentcdodds.com/actions/runs/7403195013/job/20142787532#step:6:261)5:e89b:2]:22  

Error: timeout reached waiting for health checks to pass for machine 148ed669b9e389: failed to get VM 148ed669b9e389: Get "https://api.machines.dev/v1/apps/kcd/machines/148ed669b9e389": net/http: request canceled
Your machine never reached the state "%s".

You can try increasing the timeout with the --wait-timeout flag

Error: Process completed with exit code 1.

That's a bit more helpful. From that it appears that no server is listening on port 8080 as configured in my fly.toml, which is set as the INTERNAL_PORT environment variable in my Dockerfile which is referenced as the proxy.addr in my litefs.yml.

I'm not sure what's changed from the last successful deploy. I haven't changed anything with LiteFS, Dockerfile, or fly.toml configuration (other than try to upgrade litefs after the first failure).

It does look like litefs mount has mounted to port 20202, but I believe that's for the consul service. I think we should see another entry in the processes for litefs mount listening on port 8080, and that's why I think this is a litefs issue.

Does the deploy work if you disable the proxy from litefs.yml? You may need to temporarily disable the health check as well.

Giving this a shot now.

Well, I have good news and bad news. The bad news is that disabling the proxy didn't seem to fix anything. The good news is that this is probably not the fault of LiteFS's proxy and it's my problem not yours 😅

Good grief, when I decided to overscope my personal website to match the constraints, requirements, and use cases of bigger companies with teams of full time developers, I didn't consider the fact that it would take more maintenance than a single person would devote to their side project 🤡 Why do I do this to myself?

That said, can you explain why the proxy server isn't listed in the available ports? Does the proxy server wait for the internal server to start up before it starts?

That said, can you explain why the proxy server isn't listed in the available ports? Does the proxy server wait for the internal server to start up before it starts?

The proxy server doesn't start up until the LiteFS node has sync'd with the primary or has become primary so it may be delayed when starting up.

Ah, and it doesn't become primary until the exec is successful?

Ok, so I shuffled some things around and I'm right back where I started except now it's working 🤷‍♂️

Feel free to watch in case it's helpful: https://www.loom.com/share/87c43ce85c8a4d58a76d0963b38eebe5

I'm not sure whether this is related, but this started happening yesterday as well, so maybe I'll start having deployment issues again? https://www.loom.com/share/462a794cbd4444928d0c028fb4558ddd

I'm going to scale down to a single region again and see whether that fixes anything.

To circle back, scaling to 1 and restarting the server fixed the last issue. I've scaled up to one additional region and I've enabled tracing (https://fly.io/docs/litefs/config/#trace-log) and we'll keep an eye on it. @benbjohnson thinks this is either LiteFS or my application code putting a lock on the db, though the "SQLite error or database missing" error is strange.