imjasonh/gcping

Migrate to Cloud Run

imjasonh opened this issue · 5 comments

I've been meaning to migrate to Cloud Run for a while, and I think the time might be coming sooner rather than later.

Why Cloud Run?

  1. Cost: Running 20-some VMs, even f1-micro VMs, with static IP addresses and a global load balancer, incurs a cost in the $150/month range. These VMs are nearly always idle. Cloud Run should reduce VM costs greatly, even using minimum instances to shorten initial request latency.

  2. Reproducibility: Though the VMs are simply COS VMs running a container, they occasionally become unreachable and break the site. Restarting the VMs fixes them, and I could just set up a cron job to restart everything ever hour or whatever, but ultimately I'd love to not worry about it. A better option would be a more declarative config bundle that can be applied at any time to update and refresh the site. The setup.sh script has improved a lot in the last ~4 years, but it's still overly imperative and occasionally requires hand edits to add a new region (in particular around static IP addresses). Relying on Cloud Run and making everything driven by a declarative config bundle should make the site more reproducible and able to be run by anybody who wants to run their own copy.

  3. HTTPS: It's almost 2021, running an HTTP site is frankly embarrassing. #22 has some ideas about how to support HTTPS, but wouldn't it be nice to just rely on a platform that can only serve HTTPS?

Breaking Changes

  1. As part of this migration, absolute latency values will increase -- Cloud Run apps are slightly slower to reach than raw GCE VMs, on the order of a hundred ms or so (TODO: more precise data). The site can still be useful as a relative latency dashboard ("which region is closest to me"), so long as we compare apples to apples. As part of this migration, the page should make it clearer what the intentions and limitations are, to avoid confusion.

  2. Some users have taken a dependency on http://www.gcping.com/config.js as a source of IP addresses expected to be up in a number of regions. I intend to make a new config.js available pointing to regional Cloud Run service addresses, but they won't be HTTP IP addresses, they'll be HTTPS host names. I don't know what everyone's depending on, and this might break people. (see https://www.hyrumslaw.com/)

  3. Cloud Run is not yet available in every GCP region, but I believe it intends to be. As new regions come online GCPing should auto-update to include them.

Plan

I plan to mostly start fresh from the image ko-built from ./cmd/ping and the frontend JS in index.html. The rest can be configs to deploy Cloud Run Services for every available region, and generate config.js which plugs into the frontend JS. I'll fork a branch from the latest GCE VM version in case people are interested in using that.

We'll still need a global IP and load balancer, with a serverless NEG to route to Cloud Run backends.

I'd like to update the frontend to actually use XHR instead of <img onload> (now that lowering absolute latency isn't a goal, readability/maintainability can take over), which would let the global row show which backend it reached.

More pricing breakdown:
Screen Shot 2020-11-22 at 8 09 44 AM
Screen Shot 2020-11-22 at 8 10 09 AM

  • Largest single cost is external IP addresses, followed by global load balancing
  • VM costs are the bulk of the total charges, ~$4-6/month/region
  • PD storage costs ~$.50/month/region
  • Network egress for ping responses is ~$.35/month/region

Cloud Run should be significantly cheaper since we don't need to manage individual IP addresses. We'll still pay for the global load balanaced IP and its associated managed SSL cert, but that should end up being the main remaining cost. If it's too much, I might evaluate dropping the global LB entirely.

The migration is now complete.

http://gcping.com forwards to https://gcping.com using HSTS
http://www.gcping.com forwards to https://gcping.com

Each regional service and the global LB endpoint serves the static HTML+JS frontend on /, and region-specific ping responses on /ping.

The HTML+JS frontend requests https://$region.gcping.com/ping (still using <img>) and https://global.gcping.com/ping.

I'll begin tearing down the GCE VM infrastructure and GCS buckets now.

New site loads so much faster then before. Awesome. I noticed my latency to the GLB went up from 15ms to 28ms, but at the same time ping for regions in Asia went way down and are much more stable.

Not sure if the latency spike is related tho to these changes. Could be a network anomaly on my end, take with a grain of salt until more testing is done.

I ended up getting rid of each https://${region}.gcping.com subdomain, since it required 20-some-odd load balancers and managed SSL certs, and ultimately they don't really add anything besides a little internal vanity. And it costs some $5/day to run that many load balancers, so it's not really worth it.

Instead, I've resurrected config.js which maps the region name to the Cloud Run URL (e.g., https://asia-south1-bmlfzs4h6a-el.a.run.app) which should remain stable across deployments.