Update to use delegated-routing for querying storetheindex
BigLep opened this issue ยท 19 comments
Done Criteria
Updated 2022-08-11 to capture the latest state:
- Hydras in production across the whole fleet query storetheindex using reframe rather than the storetheindex provider that was added in #158
- The custom storetheindex code in libp2p/hydra-booster is removed and deployed to production.
- Hydra dashboards have metrics for their calls to storetheindex. We can answer these questions:
- Number of calls Hydra made to STI (regardless if successful or not)
- Number of calls that Hydra got a 2xx (success) response from STI (regardless if STI has providers for the given CID or not)
- Number of calls that fataled on the server (e.g., 5xx due to server issue)
- Number of calls where client timed out (and thus didn't get a server response)
- Distribution of 2xx response payloads sizes (in terms of number of records). For each 2xx responses, we should accumulate a metric for the number of providers in the response. This allows us to say the the p90 of responses have X providers.
- Latecy of each request, broken out by status code.
Why Important
Provides first production validation of delegated routing, giving us the confidence to add it to Kubo as part of ipfs/kubo#8775
Notes
- We will use the ipld/edelweiss generated version of ipfs/go-delegated-routing happening in ipfs/go-delegated-routing#11
- The depends on storetheindex to expose a delegated-routing endpoint, happening in ipni/storetheindex#251
- https://protocollabs.grafana.net/d/lNGQTv9Zz/hydra-boosters?orgId=1&var-flight=hydra-test-.* is the Hydra dashboard that should be updated.
- Cases where storetheindex has 0 results for a given CID and the corresponding status code is an open spec item being clarified in ipfs/specs#308.
- This is "Stage 0" in https://www.notion.so/pl-strflt/Indexer-IPFS-Reframe-Q3-Plan-77d0f3fa193b438784d2d85096e315e7
- We don't need to include/deploy the latest "read"-related functionality in the reframe spec, including HTTP caching. That will happen separately when ipfs/go-delegated-routing#27 completes.
Estimation Notes
2022-08-19 estimates of work remaining:
@BigLep All tasks are done here, specifically:
- Hydra has metrics for Reframe path
- STI has metrics for Reframe path
- All old STI code is removed from Hydra
- Both Hydra and STI use the newest version of Delegated Routing and Edelweiss
- We've verified that Hydra and STI talk to each other
The next steps would be:
- deployment of STI to production (@willscott), then
- deployment of Hydra to production (@petar)
Thanks @petar . Lets track the storetheindex production deployment in ipni/storetheindex#251 . Hydra production deployment will be here.
To be clear, has https://protocollabs.grafana.net/d/lNGQTv9Zz/hydra-boosters?orgId=1&var-flight=hydra-test- been updated to use the go-delegated-routing metrics rather than the existing sti-provider metrics?
Good stuff - almost there!
@petar : following up, has https://protocollabs.grafana.net/d/lNGQTv9Zz/hydra-boosters?orgId=1&var-flight=hydra-test- been updated to use the go-delegated-routing metrics rather than the existing sti-provider metrics?
@petar : following up, has https://protocollabs.grafana.net/d/lNGQTv9Zz/hydra-boosters?orgId=1&var-flight=hydra-test- been updated to use the go-delegated-routing metrics rather than the existing sti-provider metrics?
Yes. The delegated routing code replaces the STI code and uses the same metric names. So the dashboard should work unchanged.
@petar : last thing for closing this out. Has the custom storetheindex code in libp2p/hydra-booster been removed?
@petar : last thing for closing this out. Has the custom storetheindex code in libp2p/hydra-booster been removed?
Yes.
@thattommyhall I believe you pinged that we're still using the older protocol for all but the test instance - and that's what i see on the dashboard as well.
I didn't see the removal of the http indexer code go past on github, but noting that it means we probably should coordinate broader deployment of reframe before we end up with a deployment that doesn't support the current setup.
Discussion about this effort is currently happening in the #reframe channel: https://filecoinproject.slack.com/archives/C03RQFURZLM
Per 2022-08-12 verbal conversations, @guseggert is going to drive this effort to close and will consult with @petar as needed.
So the logs originally showed the timeouts were due to a timeout while reading the response body:
2022-08-14T09:37:46.803Z ERROR service/client/delegatedrouting proto/proto_edelweiss.go:1234 client received error response (context deadline exceeded (Client.Timeout or context cancellation while reading body))
This morning I deployed 37dda22 to the test flight, which publishes more detailed error metrics for Reframe and also upgrades to go-libp2p@v0.21 and Go 1.18. After deployment, the timeouts basically disappeared, and it's been baking for a few hours and the traffic is at similar levels now without timeouts. This leads me to believe that the issue is probably that the Hydra node was taking too long to read the response body due to some environmental issue (overload from some other work it was doing). Something in the go-libp2p or Go upgrades might have also alleviated the bottleneck, e.g. libp2p resource manager. We see similar timeouts in prod with the non-Reframe StoreTheIndex client, although not nearly at the same rate, but the test flight could have gotten unlucky and been placed into a hot partition, so it could still be the same issue.
My next step is to get some metrics into the dashboard on libp2p Resource Manager to see if it's throttling anything, and understand the impact of that on the network, see if we need to tweak limits, etc. I'm guessing that RM is throttling b/c the AddProvider rate is much lower, while the STI rate is the same.
Enumerating the options to mitigate overloading:
- Add a rate limiter to cap the rate of AddProvider DHT calls
- We should do this regardless of the other options, as this is the only way we can prevent nodes from being overloaded when traffic patterns change
- Some calls will start failing for the other nodes that are calling AddProvider, what's the impact of this?
- This might already be happening with the libp2p upgrade and libp2p resource manager
- Reduce the number of heads that each node runs
- This will increase the overall cost as we'll need to scale up the fleet to accomodate
- Traffic pattern changes in the network could still cause overloads
- Do some analysis on the nature of the calls to see if caching could alleviate the load
- Are there hot CIDs/addrs that we could shed w/ caching?
- Traffic pattern changes in the network could still cause overloads
I've integrated Resource Manager, added RM metrics, added them to the dashboard, and tweaked the RM limits to be low enough to throttle spikes but to generally allow most traffic. The test node is now operating at the same capacity as before, but with minimal timeouts. There are still occasional timeouts (about 0.3% of reqs). These are timeouts reading response headers, so this may be a server-side thing, although I will increase the client-side timeout to 3 seconds to allow for e.g. GC to run w/o causing timeouts.
I'm working on this branch: https://github.com/libp2p/hydra-booster/tree/feat/reframe-metrics
I'll get a PR worked up, and continue to let this bake today. If it looks okay tomorrow morning, I'll roll it out to the rest of the fleet.
2022-08-19 conversation:
- PR is out with updated libp2p, go version, metrics, etc: #177
- We've deployed to the test suite.
- Planning to deploy to production 2022-08-22
@guseggert : other thoughts from looking at this after:
- I worry that it isn't going to be clear for anyone looking at "StoreTheIndex Reuests / Sec" what "Net", "NetTimeout", and "Other" mean. Can we maybe add an "info panel" (assuming something like that exists) with an explainer note and link to canonical information.
- Please handle ipfs/specs#308 and ensure go-delegated-routing is doing the right thing.
- For the latency metrics, do we have other values besides average. For example, I'm curious what the p99/p100 is for "success".
- Did we do this from the done criteria: "Distribution of 2xx response payloads sizes (in terms of number of records). For each 2xx responses, we should accumulate a metric for the number of providers in the response. This allows us to say the the p90 of responses have X providers."
Update: last week I deployed Reframe to the full Hydra fleet, but almost all reqs started timing out so I rolled it back. Have been debugging w/ @masih in between traveling.
Yesterday there was an STI event that caused the HTTP endpoint to behave a like the Reframe timeouts, so I'm working with @masih to understand the root cause. If it doesn't rule out Reframe, then I'll wait for the root cause fix and redeploy to see if it also works for Reframe, and if it doesn't then we'll need to do some req tracing through the infrastructure to see where exactly the timeouts are occurring. This might require adding request IDs to the reqs and passing those through LBs, proxies, etc., and adding to log messages on the STI side.
All fixes for the storetheindex outage yesterday are now deployed. At this time it is unclear if the fixes would also resolve the timeouts observed when reframe was deployed. We can try deployment and see if it does if that's not too disruptive to the users in case it doesn't.
Thanks for the updates guys. I'll keep following - let me know if anything is needed.
- Extensive update of metrics (added resource manager metrics, length of STI responses, etc.)
- Pushed through the edelweiss change to allow "cachable" methods, which switches FindProviders from POST to GET so that it can be cached by a CDN
- Plumbed that through go-delegated-routing
- Deployed to the Hydras
- Updated dashboard
This is now deployed and operational, so closing.