problems with availability of hosted version

Question

problems with availability of hosted version

HerrLevin opened this issue 3 years ago · 3 comments

In the last 24h the hosted version at v5.db.transport.rest keeps going offline. Every request (except to the docs) won't return anything, it won't even timeout. My self hosted version works as expected.

Answer 1 · 2021-10-27T13:06:39.000Z

There were two issues, overlapping:

Because the direkt.bahn.guru postings on Twitter and Reddit went a bit viral, there has been and still is a lot of traffic on the stations search API (v5.db.transport.rest/locations?query=…), the response times are higher than usual. This is due to several factors:
- A not very powerful server (Scaleway DEV1-S, 2 vCPUs, 2GB RAM) running all 4 v5.{bvg,db,hvv,vbb}.transport.rest APIs. The CPU is being exhausted briefly every few seconds.
- Something causes high latencies in the Caddy v1 load balancer -> db-rest -> Redis cache -> db-rest -> Caddy chain, and I haven't investigated yet what and why. On my laptop, the response time (for the same query, served from Redis) is in the low milliseconds, but on the server it's usually 25-60ms. 🤔
- If a request is not served from Redis, it uses Deutsche Bahn's HAFAS API, which is awfully slow for basic stations search requests.

Until yesterday, I had the Caddy load balancer configured to do active health checks against db-rest, which in turn checks if HAFAS by querying departures at some station on the next monday. And, for a reason I don't know, the HAFAS API responds with errors every now and then, which causes the health check to fail, which causes Caddy to take the entire v5.db.transport.rest API offline, responding with 504. I have turned this off temporarily, because with this setup, the benefits (better stability of v5.db.transport.rest itself) are not worth the disadvantages (frequently completely down due to random HAFAS failures).

504s should not appear often anymore, but the overall response time is still high. I'm reluctant to move the APIs to a more powerful server though, because

the traffic will drop in the coming days.
I'm currently experimenting with moving v5.{bvg,db,hvv,vbb}.transport.rest to a Kubernetes cluster run by @juliuste. I'll check if that improves the behaviour under load or not.

Regardless, if you want to volunteer running a db-rest instance, you're very much welcome to do so!

Answer 2 · 2021-10-27T13:24:01.000Z

This is the status page BTW: https://stats.uptimerobot.com/57wNLs39M/784879516

Not sure if the report 10h of 404 downtime were actually an issue. 🤔

Answer 3 · 2021-10-28T16:01:07.000Z

I'll close this. Please re-open if you have more questions related to this.