You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After asking Junpei to put upload retries in PseudoKV to infinite and spending the majority of the day doing upload tests with aws s3 sync . /path of thousands of tiny files i noticed that PseudoKV can easily get stuck and choke on bad hosts, indefinitely, and in the process upload many GB's of garbage data (see screenshots below). For this upload run we select 20 hosts out of 40 (20 extra hosts).
In our gateway every time a chunk upload fails we get the following error:
2020-08-14T16:13:56.143Z WARN storage/bucket.go:253 failed to upload the content {"path": "/kvs-5/fileaabb", "request_id": "ab3c59fab06672a377cdabf06bf3a9bcd97323b9", "bucket": "kvs-5", "key": "fileaabb/null/1", "contractSetName": "aeb4b744-681d-4cf0-b239-6ec47582d771", "try": 1, "error": "failed to get the corresponding metafile: shard not fund: open /root/.config/storewise/gateway/metafiles/kvs-5/a74982c8-67a1-4dae-a366-f082c85bc5d0/STANDARD/metafiles/aeb4b744-681d-4cf0-b239-6ec47582d771/shard/0: no such file or directory"}
After doing some research into how siad handles unreachable hosts in their filesystem it appears they are using a exponential cooldown mechanism where hosts that return an error get disabled for a certain time duration. If the host fails again the disable time increases exponentially. I don't think decaying values are necessary because this is mostly a stateless process: when the gateway is restarted the disabled list begins over. This is fair because it's the metadata-server's responsibility to handle bad hosts and migration long-term. Albeit it would make sense to communicate the failure rate (cooldown-height?) so that the gateway/metadata-server knows which hosts have the highest priority for migration. Moreover, providing this type of information allows the gateway to determine when is best to replace the contract set with a new one and send the active set to migration.
Screenshot 1;
Screenshot 2;
A few GB later and no new files have been uploaded;
Edit: Although the main issue is addressed in the above commit, it might still be a good idea to add a cooldown period of unresponsive hosts to make uploads more efficient.
After asking Junpei to put upload retries in
PseudoKV
to infinite and spending the majority of the day doing upload tests withaws s3 sync . /path
of thousands of tiny files i noticed thatPseudoKV
can easily get stuck and choke on bad hosts, indefinitely, and in the process upload many GB's of garbage data (see screenshots below). For this upload run we select 20 hosts out of 40 (20 extra hosts).In our gateway every time a chunk upload fails we get the following error:
After doing some research into how
siad
handles unreachable hosts in their filesystem it appears they are using a exponential cooldown mechanism where hosts that return an error get disabled for a certain time duration. If the host fails again the disable time increases exponentially. I don't think decaying values are necessary because this is mostly a stateless process: when the gateway is restarted the disabled list begins over. This is fair because it's the metadata-server's responsibility to handle bad hosts and migration long-term. Albeit it would make sense to communicate the failure rate (cooldown-height?) so that the gateway/metadata-server knows which hosts have the highest priority for migration. Moreover, providing this type of information allows the gateway to determine when is best to replace the contract set with a new one and send the active set to migration.Screenshot 1;
Screenshot 2;
A few GB later and no new files have been uploaded;
Logfile from our gateway;
log.txt
The text was updated successfully, but these errors were encountered: