Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PseudoKV host cooldown #104

Closed
MeijeSibbel opened this issue Aug 14, 2020 · 3 comments
Closed

PseudoKV host cooldown #104

MeijeSibbel opened this issue Aug 14, 2020 · 3 comments

Comments

@MeijeSibbel
Copy link

MeijeSibbel commented Aug 14, 2020

After asking Junpei to put upload retries in PseudoKV to infinite and spending the majority of the day doing upload tests with aws s3 sync . /path of thousands of tiny files i noticed that PseudoKV can easily get stuck and choke on bad hosts, indefinitely, and in the process upload many GB's of garbage data (see screenshots below). For this upload run we select 20 hosts out of 40 (20 extra hosts).

In our gateway every time a chunk upload fails we get the following error:

2020-08-14T16:13:56.143Z        WARN  storage/bucket.go:253     failed to upload the content    {"path": "/kvs-5/fileaabb", "request_id": "ab3c59fab06672a377cdabf06bf3a9bcd97323b9", "bucket": "kvs-5", "key": "fileaabb/null/1", "contractSetName": "aeb4b744-681d-4cf0-b239-6ec47582d771", "try": 1, "error": "failed to get the corresponding metafile: shard not fund: open /root/.config/storewise/gateway/metafiles/kvs-5/a74982c8-67a1-4dae-a366-f082c85bc5d0/STANDARD/metafiles/aeb4b744-681d-4cf0-b239-6ec47582d771/shard/0: no such file or directory"}

After doing some research into how siad handles unreachable hosts in their filesystem it appears they are using a exponential cooldown mechanism where hosts that return an error get disabled for a certain time duration. If the host fails again the disable time increases exponentially. I don't think decaying values are necessary because this is mostly a stateless process: when the gateway is restarted the disabled list begins over. This is fair because it's the metadata-server's responsibility to handle bad hosts and migration long-term. Albeit it would make sense to communicate the failure rate (cooldown-height?) so that the gateway/metadata-server knows which hosts have the highest priority for migration. Moreover, providing this type of information allows the gateway to determine when is best to replace the contract set with a new one and send the active set to migration.

Screenshot 1;

image

Screenshot 2;

A few GB later and no new files have been uploaded;

image

Logfile from our gateway;

log.txt

@jkawamoto
Copy link
Contributor

Hopefully, this commit 68b2c1b fixes the above missing shard error.

@MeijeSibbel MeijeSibbel changed the title PseudoKV host error handling PseudoKV host cooldown Aug 17, 2020
@MeijeSibbel
Copy link
Author

Edit: Although the main issue is addressed in the above commit, it might still be a good idea to add a cooldown period of unresponsive hosts to make uploads more efficient.

@MeijeSibbel
Copy link
Author

Migrated to #126.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants