storj-archived/core

Publish/offer message propagation timeouts

Closed this issue · 9 comments

storj-lib@v6.4.2
kad-quasar@v1.2.3

Issue: There are timeouts when sending PUBLISH messages w/ contracts into the network and receiving OFFER messages. The bridge will return { "error": "ESOCKETTIMEDOUT" } from the landlord/renters from a request to PUT /frames/:frame_id. The issue has been tracked down to be between RabbitMQ <-> Renter <-> Farmer, and the RabbitMQ message backlog can overflow w/ messages, indicating that the issue is between Renter <-> Farmer.

Hypothesis: Taking a look at https://github.com/kadtools/kad-quasar/blob/610bc77e3679f08bc9798b36e04e592456f73fdb/lib/quasar.js#L79 and if there are timeouts in relaying to closest 3 neighbors, the message will not propagate fully which could lead the farmers with the available space not receiving the message and not able to respond with an OFFER.

Solution A: Verify that message is sent to neighbors, and send to another neighbor. However, waiting for a timeout to send to another neighbor may be counterproductive, as the subsequent OFFER would be delayed and the originating request/response would timeout.

Solution B: Seed publish messages from multiple locations as a work around to any non-relaying neighbors, increasing the chances of well propagated messages.

Solution C: Keep table of contacts with storage space available and reliability rating and send requests directly to the nodeIDs that are the best fit, this would greatly reduce the number of message that nodes need to relay.

Took a dive into farmer logs and found the following that could contribute in timeouts for storage offers.

  1. Before a farmer will send an OFFER after receiving/sending three PUBLISH messages, it will perform a network walk for a node with several FIND_NODE messages, a bunch of contacts will be updated. This delay could cause delays when sending OFFER.

  2. Some of the offers were rejected because it was closed (happened fairly frequently):

{"level":"warn","message":"Contract no longer open to offers","timestamp":"2017-05-24T17:45:42.226Z"}
{"level":"warn","message":"Contract no longer open to offers","timestamp":"2017-05-24T18:50:07.173Z"}
{"level":"warn","message":"Contract no longer open to offers","timestamp":"2017-05-24T18:50:34.330Z"}
{"level":"warn","message":"Contract no longer open to offers","timestamp":"2017-05-24T18:52:48.988Z"}
{"level":"warn","message":"Contract no longer open to offers","timestamp":"2017-05-24T19:04:25.796Z"}

  1. Some publish messages fail to send to all three closest neighbors (not very frequent):
{"level":"error","message":"failed to publish message to contact {\"userAgent\":\"6.3.2\",\"protocol\":\"1.1.0\",\"address\":\"149.202.221.31\",\"port\":4003,\"nodeID\":\"ec9fd58419c9667fc7f50785ee14c39796a1880f\",\"lastSeen\":1495648378954} with topic \"0e01\", reason: \"RPC with ID `ef5b13ce658f7a0b3190f8afd76875d9b3ae41ad` timed out\"","timestamp":"2017-05-24T17:56:12.514Z"}
  1. There were cases where bloom filters failed to update to neighbors, if these are not correctly updated the publication of messages may not work correctly:
{"level":"warn","message":"failed to update neighbor with bloom filter","timestamp":"2017-05-24T18:57:59.658Z"}

  1. There were also cases where the farmer was unable to send an offer because it was unable to find the renter (not very frequent):
{"level":"warn","message":"could not locate renter for offer","timestamp":"2017-05-24T19:00:44.779Z"}

For number 3 and 4: depending at what level and rate the PUBLISH message fails, it could be the difference between reaching 12 vs 700 farmers, which was one our original thoughts on why there could be issues, similar with the bloom filter updates.

For number 1: Including the originating contact on the PUBLISH messages (I don't think it's already there) could help to respond w/ OFFER faster.

In general scaling improvement: We've also talked about having the bridge select a specific 30 or so farmers to send a PUBLISH message directly with a TTL of zero. This would mean there would be far fewer messages needed to relay. We would need some metrics to select, including capacity and uptime. There would also need to be some discovery on the farmer side so the bridge is aware of the farmer.

44203 commented

FYI kad-quasar@2.x.x will callback with errors so implementors can employ retry logic on things like failed bloom filter exchange, publication failures, etc.

Until then, another possible interim solution would be to increase ALPHA used by quasar (3) and decrease TTL on publications, which would provide a wider but more shallow message propagation overall (which given that all nodes are subscribed to the same topics right now, might be a sufficient solution until topic diversity increases).

Perhaps update ALPHA to 8 and decrease TTL to 1. Unless my math is wrong, that should end up as 8^2 (64) which is a lot less chatty than 3^6 (729) and might give us more even distribution. Plus given we only queue 24 mirrors at a time, it seems reasonable to only attempt to reach 3x that.

8^2 + 8 = 72

I would expect that 4 of my neighbours will receive the contract twice. At the end only 40 OFFER.

Last stress test my farmer was unable to send any OFFER because one single farmer is running 1000 nodes and will create mirrors with less than 10KByte/s. I am sure my neighbors have to deal with him as well. That will reduce the total number of OFFER as well.

I don't think the bridge will get a full mirror queue.

FYI kad-quasar@2.x.x will callback with errors so implementors can employ retry logic on things like failed bloom filter exchange, publication failures, etc.

Unfortunately though by the time there is a timeout, retrying will be too late to respond, as the API request will likely timeout before a retry can be made.

Increasing the width of ALPHA may be helpful to guard against those types of failures (3 ^ 6 = 729 | 5 ^ 4 = 625), however would need to verify that farmers further in distance would also have fair opportunity to respond with an OFFER.

Perhaps things could flip around and the farmers would broadcast a message via quasar to bridge/renter subscribers with details of space available. The bridge/renters would then keep track of known farmers, and send messages directly. This way the propagation of messages can take longer and have retries to make sure messages are propagated fully.

44203 commented

Implemented in storjd as capacity announcements/CLAIM RPC

👋 Hey! Thanks for this contribution. Apologies for the delay in responding!

We've decided to rearchitect Storj, so that we can scale better. You can read more about this decision here. This means that we are entirely focused on v3 at the moment, in the storj/storj repository. Our white paper for v3 is coming very, very soon - follow along on the blog and in our Rocketchat.

As this repository is part of the v2 network, we're no longer maintaining this repository. I am going to close this for now. If you have any questions, I encourage you to jump on Rocketchat and ask them there. Thanks!