lukechampine/us

PseudoKV host cooldown

MeijeSibbel opened this issue · 3 comments

After asking Junpei to put upload retries in PseudoKV to infinite and spending the majority of the day doing upload tests with aws s3 sync . /path of thousands of tiny files i noticed that PseudoKV can easily get stuck and choke on bad hosts, indefinitely, and in the process upload many GB's of garbage data (see screenshots below). For this upload run we select 20 hosts out of 40 (20 extra hosts).

In our gateway every time a chunk upload fails we get the following error:

2020-08-14T16:13:56.143Z        WARN  storage/bucket.go:253     failed to upload the content    {"path": "/kvs-5/fileaabb", "request_id": "ab3c59fab06672a377cdabf06bf3a9bcd97323b9", "bucket": "kvs-5", "key": "fileaabb/null/1", "contractSetName": "aeb4b744-681d-4cf0-b239-6ec47582d771", "try": 1, "error": "failed to get the corresponding metafile: shard not fund: open /root/.config/storewise/gateway/metafiles/kvs-5/a74982c8-67a1-4dae-a366-f082c85bc5d0/STANDARD/metafiles/aeb4b744-681d-4cf0-b239-6ec47582d771/shard/0: no such file or directory"}

After doing some research into how siad handles unreachable hosts in their filesystem it appears they are using a exponential cooldown mechanism where hosts that return an error get disabled for a certain time duration. If the host fails again the disable time increases exponentially. I don't think decaying values are necessary because this is mostly a stateless process: when the gateway is restarted the disabled list begins over. This is fair because it's the metadata-server's responsibility to handle bad hosts and migration long-term. Albeit it would make sense to communicate the failure rate (cooldown-height?) so that the gateway/metadata-server knows which hosts have the highest priority for migration. Moreover, providing this type of information allows the gateway to determine when is best to replace the contract set with a new one and send the active set to migration.

Screenshot 1;

image

Screenshot 2;

A few GB later and no new files have been uploaded;

image

Logfile from our gateway;

log.txt

Hopefully, this commit 68b2c1b fixes the above missing shard error.

Edit: Although the main issue is addressed in the above commit, it might still be a good idea to add a cooldown period of unresponsive hosts to make uploads more efficient.

Migrated to #126.