PseudoKV host cooldown
MeijeSibbel opened this issue · 3 comments
After asking Junpei to put upload retries in PseudoKV
to infinite and spending the majority of the day doing upload tests with aws s3 sync . /path
of thousands of tiny files i noticed that PseudoKV
can easily get stuck and choke on bad hosts, indefinitely, and in the process upload many GB's of garbage data (see screenshots below). For this upload run we select 20 hosts out of 40 (20 extra hosts).
In our gateway every time a chunk upload fails we get the following error:
2020-08-14T16:13:56.143Z WARN storage/bucket.go:253 failed to upload the content {"path": "/kvs-5/fileaabb", "request_id": "ab3c59fab06672a377cdabf06bf3a9bcd97323b9", "bucket": "kvs-5", "key": "fileaabb/null/1", "contractSetName": "aeb4b744-681d-4cf0-b239-6ec47582d771", "try": 1, "error": "failed to get the corresponding metafile: shard not fund: open /root/.config/storewise/gateway/metafiles/kvs-5/a74982c8-67a1-4dae-a366-f082c85bc5d0/STANDARD/metafiles/aeb4b744-681d-4cf0-b239-6ec47582d771/shard/0: no such file or directory"}
After doing some research into how siad
handles unreachable hosts in their filesystem it appears they are using a exponential cooldown mechanism where hosts that return an error get disabled for a certain time duration. If the host fails again the disable time increases exponentially. I don't think decaying values are necessary because this is mostly a stateless process: when the gateway is restarted the disabled list begins over. This is fair because it's the metadata-server's responsibility to handle bad hosts and migration long-term. Albeit it would make sense to communicate the failure rate (cooldown-height?) so that the gateway/metadata-server knows which hosts have the highest priority for migration. Moreover, providing this type of information allows the gateway to determine when is best to replace the contract set with a new one and send the active set to migration.
Screenshot 1;
Screenshot 2;
A few GB later and no new files have been uploaded;
Logfile from our gateway;
Edit: Although the main issue is addressed in the above commit, it might still be a good idea to add a cooldown period of unresponsive hosts to make uploads more efficient.
Migrated to #126.