PseudoKV host cooldown #104

MeijeSibbel · 2020-08-14T16:30:15Z

After asking Junpei to put upload retries in PseudoKV to infinite and spending the majority of the day doing upload tests with aws s3 sync . /path of thousands of tiny files i noticed that PseudoKV can easily get stuck and choke on bad hosts, indefinitely, and in the process upload many GB's of garbage data (see screenshots below). For this upload run we select 20 hosts out of 40 (20 extra hosts).

In our gateway every time a chunk upload fails we get the following error:

2020-08-14T16:13:56.143Z        WARN  storage/bucket.go:253     failed to upload the content    {"path": "/kvs-5/fileaabb", "request_id": "ab3c59fab06672a377cdabf06bf3a9bcd97323b9", "bucket": "kvs-5", "key": "fileaabb/null/1", "contractSetName": "aeb4b744-681d-4cf0-b239-6ec47582d771", "try": 1, "error": "failed to get the corresponding metafile: shard not fund: open /root/.config/storewise/gateway/metafiles/kvs-5/a74982c8-67a1-4dae-a366-f082c85bc5d0/STANDARD/metafiles/aeb4b744-681d-4cf0-b239-6ec47582d771/shard/0: no such file or directory"}

After doing some research into how siad handles unreachable hosts in their filesystem it appears they are using a exponential cooldown mechanism where hosts that return an error get disabled for a certain time duration. If the host fails again the disable time increases exponentially. I don't think decaying values are necessary because this is mostly a stateless process: when the gateway is restarted the disabled list begins over. This is fair because it's the metadata-server's responsibility to handle bad hosts and migration long-term. Albeit it would make sense to communicate the failure rate (cooldown-height?) so that the gateway/metadata-server knows which hosts have the highest priority for migration. Moreover, providing this type of information allows the gateway to determine when is best to replace the contract set with a new one and send the active set to migration.

Screenshot 1;

Screenshot 2;

A few GB later and no new files have been uploaded;

Logfile from our gateway;

log.txt

The text was updated successfully, but these errors were encountered:

jkawamoto · 2020-08-15T08:57:21Z

Hopefully, this commit 68b2c1b fixes the above missing shard error.

MeijeSibbel · 2020-08-17T10:09:33Z

Edit: Although the main issue is addressed in the above commit, it might still be a good idea to add a cooldown period of unresponsive hosts to make uploads more efficient.

MeijeSibbel · 2020-09-09T22:55:09Z

Migrated to #126.

MeijeSibbel changed the title ~~PseudoKV host error handling~~ PseudoKV host cooldown Aug 17, 2020

MeijeSibbel mentioned this issue Sep 9, 2020

PseudoKV roadmap (discussion) #126

Open

10 tasks

MeijeSibbel closed this as completed Sep 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PseudoKV host cooldown #104

PseudoKV host cooldown #104

MeijeSibbel commented Aug 14, 2020 •

edited

Loading

jkawamoto commented Aug 15, 2020

MeijeSibbel commented Aug 17, 2020

MeijeSibbel commented Sep 9, 2020

PseudoKV host cooldown #104

PseudoKV host cooldown #104

Comments

MeijeSibbel commented Aug 14, 2020 • edited Loading

jkawamoto commented Aug 15, 2020

MeijeSibbel commented Aug 17, 2020

MeijeSibbel commented Sep 9, 2020

MeijeSibbel commented Aug 14, 2020 •

edited

Loading