Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

naive mpu #2

Merged
merged 11 commits into from
Jul 18, 2024
Merged

naive mpu #2

merged 11 commits into from
Jul 18, 2024

Conversation

aajtodd
Copy link
Contributor

@aajtodd aajtodd commented Jul 17, 2024

Description of changes:

This PR completes a naive / strawman implementation of multipart upload.

Main components:

  • Uploader - implements multipart upload
    • Takes the input stream and calculates part size using the size hint, then creates N workers that simply loop until no more parts from the part reader are available. Returns an UploadHandle to caller after the MPU is started.
  • UploadHandle - returned to caller to control an upload (e.g. pause/resume/progress/wait for completion). Currently CompleteMultipartUpload is invoked by "joining" the handle. Joining involves waiting for all of the parts to be uploaded and then completing the call.
  • ConcurrencySetting/TargetPartSize - introduced and refactored new common settings for expressing concurrency and part size settings that allow for an "auto" mode as the default. Right now these just default to some hard coded constant but in future leaves us room to introduce whatever logic we want here.

Future/Questions

  1. Checksums are ignored right now, we need a design around how we want checksums to work for both upload/download. I've tried to place TODO/FIXME comments where appropriate as it relates to this.
  2. Errors definitely need reworked at this point. In particular I've placed todo/fixme around completing an upload as we may end up with one or more part failures and possible a failure related to aborting the upload as well. How this gets exposed to the user needs some thought. Most likely we want to move towards defining an UploadError struct with different error "kinds" that capture various states/composite failures.
  3. Single part upload (content length < min part size) is not implemented yet.
  4. More tests around various scenarios (e.g. failed part, failed abort during processing a failed part, etc). I found using our experimental mocks package a bit tedious to setup, wondering if this is the direction to go still or not or if we have criteria for when to leverage this vs other testing strategies.
  5. Questions on when various upload request / response fields get set when splitting vs not splitting the request. Some of the fields only apply to one case or the other.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@aajtodd aajtodd requested a review from a team as a code owner July 17, 2024 16:01
aws-s3-transfer-manager/src/types.rs Show resolved Hide resolved
Comment on lines 8 to 14
pub enum TargetPartSize {
/// Automatically configure an optimal target part size based on the execution environment.
#[default]
Auto,

/// Explicitly configured part size.
Explicit(u64),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debatable: Seems weird to have an "Explicit" option that might actually get adjusted?

Maybe name it like:

  • PartSize::Auto
  • PartSize::Target(u64): gets adjusted if necessary

and if some future person ever really wants to get explicit we could add:

  • PartSize::Explicit(u64): never adjusted, causes some configuration error if it's too large or too small

Copy link
Contributor Author

@aajtodd aajtodd Jul 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see your point but we are only adjusting for min part size IIRC. Is this distinction going to be worth it? Do we think we'd adjust it anywhere else?

edit: Right we adjust if it's more than 10k parts as well 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the name of the type is TargetPartSize already which conveys this IMO.

aws-s3-transfer-manager/src/upload.rs Outdated Show resolved Hide resolved
aws-s3-transfer-manager/src/upload.rs Outdated Show resolved Hide resolved
aws-s3-transfer-manager/src/upload.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@graebm graebm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am super annoying. We can address the naming crap another day when we get a grip on all our settings

///
/// The minimum part size is 5 MiB, any part size less than that will be rounded up.
/// Default is [PartSize::Auto]
pub fn multipart_threshold_part_size(self, multipart_threshold_part_size: PartSize) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more annoying name suggestion: multipart_threshold_bytes or just multipart_threshold

  • having "part" in the name twice is weird, other than we're re-using the PartSize enum. It seems annoying to make another enum just for this setting, but ... is that how it's done?
  • seems nice to use the term "bytes", as the SEP does? I see us clarifying it in some places like MIN_PART_SIZE_BYTES

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need another enum here but we can live with it for a bit and see what we think. Changed to multipart_threshold(...) as suggested.

The distinction where we have _bytes suffix is usually wherever we have raw u64 values. We could probably make some simple abstractions like StorageUnit or ByteSize or whatever that has conversions between bytes/MB/MiB/etc, may be useful for throughput calculations as well. Leaving for now though.

///
/// [`multipart_threshold_part_size`]: method@Self::multipart_threshold_part_size
/// [`PutObject`]: https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html
pub fn target_part_size(self, part_size: PartSize) -> Self {
Copy link
Contributor

@graebm graebm Jul 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh god I'm annoying: just call this part_size() now? part_size_bytes?

@aajtodd aajtodd merged commit fc227cc into main Jul 18, 2024
14 of 15 checks passed
@aajtodd aajtodd deleted the atodd/naive-mpu branch July 18, 2024 18:51
Ok(mut completed_parts) => {
all_parts.append(&mut completed_parts);
}
// TODO(aws-sdk-rust#1159, design) - do we want to return first error or collect all errors?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good design question. I suspect both have legit use cases, so it's conceivable we end up giving customers an option for which error report mode they like to use.

/// * `handle` - The upload handle
/// * `stream` - The content to upload
/// * `content_length` - The upper bound on the content length
async fn try_start_mpu_upload(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Late reviewing this, but should the mpu in this be mp? Otherwise I would read it as try_start_multipart_upload_upload

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants