Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrofit of upload logic #178

Merged
merged 3 commits into from
Mar 4, 2016
Merged

Retrofit of upload logic #178

merged 3 commits into from
Mar 4, 2016

Conversation

kofalt
Copy link
Contributor

@kofalt kofalt commented Mar 1, 2016

This constitutes a significant overhaul of file upload processing into logic that can be deduplicated, made consistent, and eventually optimized.

Two of four upload scenarios are covered by this change: targeted (files to a single container), and a new mode packfile (files to a single zip archive). Connector and engine uploads have not been modified, but clear groundwork is laid for that transition.

Relates #159.

@kofalt kofalt force-pushed the upload-unification branch 2 times, most recently from 1cccadb to c79eda2 Compare March 1, 2016 23:08
@gsfr
Copy link
Member

gsfr commented Mar 2, 2016

Ran out of time today. Will review tomorrow.

MUST send metadata about the files | | X | | X

Creates a packfile from uploaded files | | | | X
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"connector" is a Flywheel marketing term. The correct scitran terminology is "reaper". Please change globally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@gsfr
Copy link
Member

gsfr commented Mar 2, 2016

Looks good overall. Little confused by the two commits.

I know we've talked about various optimization opportunities, but how usable is the current implementation? Is it usable for a realistic dataset on a realistic storage system (not an idle SSD)?

@rentzso: Please also take a look. If you see obvious de-dup opportunities, especially with your existing code, please note them in another ticket.

Maybe we could all get together on Friday to discuss next steps. Optimization, de-dup, schema unification.

@kofalt
Copy link
Contributor Author

kofalt commented Mar 2, 2016

Little confused by the two commits.

Yeah, that was meant to be squashed. I'll leave it until merge to avoid disturbing the review threads.

I know we've talked about various optimization opportunities, but how usable is the current implementation? Is it usable for a realistic dataset on a realistic storage system (not an idle SSD)?

Not a clue; I think the easiest way to test is going to be in the browser. Alternatively, generating a very long curl command. This version doesn't allow repeated form fields - accessing a form field name that was repeated returns a list, and I don't handle that case - so it's not as easy as file@a, file@b, etc. I've passed that URL to Dan for testing.

@gsfr
Copy link
Member

gsfr commented Mar 2, 2016

@dpuccetti: Please let us know what you find?

if strategy == Strategy.targeted and len(file_fields) > 1:
raise Exception("Targeted uploads can only send one file")

for field in file_fields:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we processing all the file information in the metadata?
It should be possible in the engine to send metadata about a file without sending the binary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great point.

The two use cases hooked up in this PR, Targeted and Packfile, will work as designed - they will always send files. When I hook up Engine - ref #159 - the metadata processing should probably be in EnginePlacer.check() or EnginePlacer.check() such that metadata processing happens even if no files were uploaded.

Summary - no change required now, will be kept in mind for later in #159.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Just thinking that a more appropriate name for Placer.check could be Placer.initialize.

@kofalt kofalt force-pushed the upload-unification branch from 0fe75d1 to 26b83a1 Compare March 2, 2016 23:01
@kofalt
Copy link
Contributor Author

kofalt commented Mar 2, 2016

All comments addressed; rebased and squashed. Ready for merge.

@rentzso
Copy link
Contributor

rentzso commented Mar 2, 2016

LGTM
Please consider renaming Placer.check to Placer.initialize

@gsfr
Copy link
Member

gsfr commented Mar 3, 2016

Shouldn't the check() code simply go into __init__()?

@kofalt
Copy link
Contributor Author

kofalt commented Mar 3, 2016

I chose not to do that because I need control over when that function is called, which might become important to accomplish a few things in #159. This reduces the likelihood that I'll have to break the interface while Renzo is working on things, etc.

I'll add a point on #159 at the end to consider changing up the naming and such.

@kofalt kofalt self-assigned this Mar 3, 2016
}

# Get or create a session based on the hierarchy and provided labels.
s = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just noticing that the packfile placer is not setting the group on the session. This is a required field as from the datamodel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait... okay help me reconcile that with the schemas.
The input session schema has no group key, but the mongo session schema does?

Are you asking that I grab the group ID from the project object and place it on the session object? Should I also do that to the acquisition?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah you need to get the group id from the project object. The input session schema doesn't have the group so that you just need the parent project to create a session inside it. I don't remember the reason why we need to store the group id in the session (Cc @gsfr) but it is a required field.

The acquisition doesn't need any other key.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember the reason why we need to store the group id in the session (Cc @gsfr) but it is a required field.

The field is needed to easily filter the sessions list by group.

@kofalt
Copy link
Contributor Author

kofalt commented Mar 4, 2016

@rentzso Check out 4fdbb93, which adds group id to the upserted session.
If that looks good to you then I'll merge :)

@rentzso
Copy link
Contributor

rentzso commented Mar 4, 2016

LGTM!

kofalt added a commit that referenced this pull request Mar 4, 2016
@kofalt kofalt merged commit 4402735 into master Mar 4, 2016
@kofalt kofalt deleted the upload-unification branch March 4, 2016 19:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants