Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Hedging (Retry Slow Parts) for Upload #57

Merged
merged 47 commits into from
Oct 8, 2024

Conversation

waahm7
Copy link
Contributor

@waahm7 waahm7 commented Oct 1, 2024

Description of changes:

  • Implements Hedging (retry slow parts) for upload. Based on my experiments, I didn't see any difference for Download, so this PR only adds it for Upload. We can easily add it to Download stack in future if needed.

Here are the benchmarks with and without hedging for Upload:

60GB Upload with 250 concurrency:
With hedging:
Run:1 Secs:16.674105 Gb/s:30.909969
Run:2 Secs:16.062010 Gb/s:32.087895
Run:3 Secs:16.787110 Gb/s:30.701895
Run:4 Secs:15.197997 Gb/s:33.912106
Run:5 Secs:16.218516 Gb/s:31.778250
Run:6 Secs:16.623848 Gb/s:31.003417
Run:7 Secs:16.910915 Gb/s:30.477126
Run:8 Secs:15.788846 Gb/s:32.643048
Run:9 Secs:15.869860 Gb/s:32.476409
Run:10 Secs:16.757620 Gb/s:30.755923
Without Hedging:
Run:1 Secs:17.601898 Gb/s:29.280711
Run:2 Secs:16.084260 Gb/s:32.043505
Run:3 Secs:17.486960 Gb/s:29.473167
Run:4 Secs:17.730397 Gb/s:29.068502
Run:5 Secs:17.425133 Gb/s:29.577741
Run:6 Secs:16.279045 Gb/s:31.660093
Run:7 Secs:16.883302 Gb/s:30.526971
Run:8 Secs:17.831837 Gb/s:28.903140
Run:9 Secs:17.302441 Gb/s:29.787478
Run:10 Secs:17.429089 Gb/s:29.571028

30 GB with 250 concurrency:
With Hedging:
Run:1 Secs:9.298489 Gb/s:27.713968
Run:2 Secs:9.620810 Gb/s:26.785483
Run:3 Secs:8.474216 Gb/s:30.409661
Run:4 Secs:8.710808 Gb/s:29.583712
Run:5 Secs:8.909990 Gb/s:28.922372
Run:6 Secs:8.086036 Gb/s:31.869516
Run:7 Secs:7.887725 Gb/s:32.670770
Run:8 Secs:9.721999 Gb/s:26.506692
Run:9 Secs:8.053054 Gb/s:32.000038
Run:10 Secs:8.440625 Gb/s:30.530680

Without Hedging:
Run:1 Secs:11.358797 Gb/s:22.687090
Run:2 Secs:10.803229 Gb/s:23.853797
Run:3 Secs:11.495138 Gb/s:22.418003
Run:4 Secs:9.150468 Gb/s:28.162280
Run:5 Secs:10.896210 Gb/s:23.650245
Run:6 Secs:10.624905 Gb/s:24.254150
Run:7 Secs:10.698735 Gb/s:24.086777
Run:8 Secs:9.340982 Gb/s:27.587897
Run:9 Secs:11.398385 Gb/s:22.608294
Run:10 Secs:10.350546 Gb/s:24.897047

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

# Conflicts:
#	aws-s3-transfer-manager/Cargo.toml
#	aws-s3-transfer-manager/src/middleware.rs
#	aws-s3-transfer-manager/src/operation/upload/service.rs
@waahm7 waahm7 requested a review from a team as a code owner October 1, 2024 21:43
pub(crate) fn new(policy: P) -> Self {
Self {
policy,
latency_percentile: 95.0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably define these as constants and maybe some comments on how/why they were chosen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I have moved it to constants and added docs.

aws-s3-transfer-manager/src/middleware/hedge.rs Outdated Show resolved Hide resolved
let svc = ServiceBuilder::new()
.layer(concurrency_limit)
// FIXME - This setting will need to be globalized.
.buffer(ctx.handle.num_workers())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Given this doesn't actually place requests on the wire yet is there a reasonable constant that would make sense here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The buffer layer docs mention that

/// When [`Buffer`]'s implementation of [`poll_ready`] returns [`Poll::Ready`], it reserves a
/// slot in the channel for the forthcoming [`call`]. However, if this call doesn't arrive,
/// this reserved slot may be held up for a long time. As a result, it's advisable to set
/// `bound` to be at least the maximum number of concurrent requests the [`Buffer`] will see.
/// If you do not, all the slots in the buffer may be held up by futures that have just called
/// [`poll_ready`] but will not issue a [`call`], which prevents other senders from issuing new
/// requests."

So I thought that the max number of requests was a good value here.

svc.map_err(|err| {
let e = err
.downcast::<error::Error>()
.unwrap_or_else(|err| Box::new(error::Error::new(error::ErrorKind::RuntimeError, err)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Do you need to box this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how to get rid of the box here. downcast returns the Box<Error>, and the Box::into_inner API is unstable. I had to box it to have the same type in both branches.

#[derive(Debug, Clone)]
pub(crate) struct UploadPartPolicy;

impl Policy<UploadPartRequest> for UploadPartPolicy {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trivial: You could probably just make this the default policy since downloads would look the same and move it into the hedge module directly. If we ever need to differentiate we could but then you could just do something like:

let svc = ServiceBuilder::new()
    ...
    .layer(hedge::Builder::default().into_layer())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated it to the default policy.

aws-s3-transfer-manager/src/middleware/hedge.rs Outdated Show resolved Hide resolved
aws-s3-transfer-manager/src/middleware/hedge.rs Outdated Show resolved Hide resolved
ServiceBuilder::new().layer(concurrency_limit).service(svc)

let svc = ServiceBuilder::new()
.layer(concurrency_limit)
Copy link
Contributor

@graebm graebm Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extra "hedging" request should count extra towards the concurrency limit. If this is hard to do, let's do it in a followup task.

Even better would be if hedged requests were lower priority, and only got a ticket when there's no other work to do. Since they really only help us if we're done with a workload and all that's left is the slow requests. Otherwise they're just stealing from another request that would be more useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have looked into this, and it seems like the hedged requests can bypass the concurrency_limit layer. Reorganizing the layers had no effect. I have added a TODO to fix this for now since I want to get this merged and see if there is any difference in benchmarks.

Another interesting thing I noticed is that we are hedging very few requests, like for a 30GB upload with 125 concurrency, we only hedged less 5-15 requests. We can revisit the numbers once we have more tracing/metrics in place.

@waahm7 waahm7 merged commit 31550a0 into main Oct 8, 2024
14 checks passed
@waahm7 waahm7 deleted the waqar/upload-perf-buffer branch October 8, 2024 21:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants