-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allowing Project Paths to be passed to scrunch.dataset.fork() method #463
Changes from 5 commits
c84a7e8
8d318df
5ed329e
c9f0995
0f4ca00
de721c2
b507aeb
ea2d8ec
45f356f
bd23e59
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -15,11 +15,6 @@ | |
pd = None | ||
|
||
|
||
if sys.version_info.major == 3: | ||
from collections.abc import Mapping | ||
else: | ||
from collections import Mapping | ||
|
||
import six | ||
|
||
import pycrunch | ||
|
@@ -51,8 +46,10 @@ | |
|
||
if six.PY2: # pragma: no cover | ||
from urlparse import urljoin | ||
from collections import Mapping | ||
else: | ||
from urllib.parse import urljoin | ||
from collections.abc import Mapping | ||
|
||
|
||
_MR_TYPE = 'multiple_response' | ||
|
@@ -2387,23 +2384,23 @@ def fork(self, description=None, name=None, is_published=False, | |
:param preserve_owner: bool, default=True | ||
If True, the owner of the fork will be the same as the parent | ||
dataset otherwise the owner will be the current user in the | ||
session and the Dataset will be set under `Persona Project` | ||
session and the Dataset will be set under `Personal Project`. | ||
:kwargs owner: str (optional) | ||
When preserve_owner=False, the owner of fork will be the set | ||
according to the passed Project URL or Folder path. | ||
|
||
:returns _fork: scrunch.datasets.BaseDataset | ||
""" | ||
from scrunch.mutable_dataset import MutableDataset | ||
nforks = len(self.resource.forks.index) | ||
|
||
if name is None: | ||
nforks = len(self.resource.forks.index) | ||
dsname = self.resource.body.name | ||
if six.PY2: | ||
name = "FORK #{} of {}".format( | ||
nforks + 1, | ||
self.resource.body.name.encode("ascii", "ignore")) | ||
else: | ||
name = "FORK #{} of {}".format( | ||
nforks + 1, | ||
self.resource.body.name) | ||
if description is None: | ||
description = self.resource.body.description | ||
dsname = dsname.encode("ascii", "ignore") | ||
name = "FORK #{} of {}".format(nforks + 1, dsname) | ||
|
||
description = description or self.resource.body.description | ||
|
||
body = dict( | ||
name=name, | ||
|
@@ -2412,14 +2409,22 @@ def fork(self, description=None, name=None, is_published=False, | |
**kwargs | ||
) | ||
|
||
owner = kwargs.get("owner") | ||
if preserve_owner: | ||
body['owner'] = self.resource.body.owner | ||
elif owner: | ||
body["owner"] = ( | ||
owner if owner.startswith("http") else get_project(owner).url | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These should be encouraged to be Crunch objects rather than strings. For projects don't allow strings, urls or paths. Because you'll have problems validating here. Have this be a Project instance. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. question: @jjdelc Double-checking on this - Isn't get_project() validating the string for us? It will return a
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, you're right. I missed that part of the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks @jjdelc - so a green light on allowing path strings? |
||
) | ||
|
||
# not returning a dataset | ||
payload = shoji_entity_wrapper(body) | ||
_fork = self.resource.forks.create(payload).refresh() | ||
# return a MutableDataset always | ||
fork_ds = MutableDataset(_fork) # Fork has same editor as current user | ||
return fork_ds | ||
try: | ||
_fork = self.resource.forks.create(payload).refresh() | ||
except TaskProgressTimeoutError as exc: | ||
_fork = exc.entity.wait_progress(exc.response).refresh() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you get a timeout, you should continue to raise, this is an invalid call with the exception. If it timed out it quite likely failed, and the code flow shouldn't continue bc you won't have a fork to carry on with. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. question: @jjdelc Do you refer here to catching the timeout inside the fork() method directly? Because it is common practice for users to catch the timeouts on forks with a progress tracking against a larger timeout. We simply do not want to have this inside the method here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What I mean is that if pycrunch raised a timeout, there's no recovery. Calling There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep - got it now, @jjdelc! Make sense. @shaikh-ma I think if we would want control the timeout AND also return the fork when it completed, we would need to do this:
I think this should work (at least I think that's what we do in other scenarios)? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jjdelc Looks like handling the timeout seems to work in our example, as demonstrated by @shaikh-ma - what do you think? I guess this is the last open point, then we should be good? The test has also been added. Appreciate your time (when you can give it some 👀)! Thanks! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So this looks like it is continuing to wait after the client has timed out, say pycrunch's timeout is 2m and the operation is taking longer than that. On the first exception the client will raise TimeOut, but the server may still carry on updating the progress. So that's what that 2nd attempt is doing, it's giving it another chance to continue. That 2nd one on its own may also raise Timeout in case it takes longer than, now, 4 minutes. And you'll be back at the same situation. This is equivalent of saying It's a way to get around longer tasks, but code wise it reads very strange as it is not clear what the intention of this 2nd attempt is. Maybe add some comments, or even better if you want to increase the timeout then best to configure a longer timeout all together |
||
|
||
return MutableDataset(_fork) | ||
|
||
def replace_values(self, variables, filter=None, literal_subvar=False, timeout=60): | ||
""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If they choose to preserve owner, what they mean is that it should be in the same place as the current project. The API already returns the project URL for the
owner
attribute. So this may also beself.resource.project
.And we can slowly stop the conversaion about owners, because that concept does not exist in the API anymore and will bring confusion to users that individuals have ownership of datasets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: @jjdelc I am a bit confused here - We want to keep the
preserve_owner
kwarg and add a new one,project
, explicitly to the signature by your suggestion; then basically fall back toself.resource.project
(simple code change to abandon the oldowner
property) ifproject
is not given? Just making sure we get this right!There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, really at this point owner and project are equivalent. There is no concept of an individual owner in Crunch.
It looks like
preserve_owner
is the default behavior when no owner is passed in, if preserve_owner is False, and no owner is passed, then there's no explicit code path for this.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for confirming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jjdelc , the
self.resource.project.self
fails when forking a dataset into the personal project., i.e., when the preserve_owner = True for a dataset in the personal project.Hence, I've kept it unchanged for now i.e,
body['owner'] = self.resource.body.owner
.Additionally, I've added the
project
parameter as suggested and also have added the test intest_dataset.py
:Also, I've added a few validations,