-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allowing Project Paths to be passed to scrunch.dataset.fork() method #463
Allowing Project Paths to be passed to scrunch.dataset.fork() method #463
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intention of this ticket is correct, it is a good direction to move away from dataset "owners" and think of them as living inside a project.
But I don't think this implementation reflects that.
What I suggest is using a new argument called project
and have it receive a Project instance (or maybe a project ID).
That way you can call dataset.fork(project=destination_project)
And I think that the default behavior should be:
def fork(self, ..., project=None, ...):
project = project or self.project
This way, new forks without any location information would be created on the same project as the current dataset (the user being an editor on the dataset to issue a fork, should also have edit permissions on the project to place the new fork)
Bonus points: remove the owner stuff completely, it is not an official use case anymore, is anybody using it?
This sounds a great idea @jjdelc! 🙂 Initially @alextanski & I, had a similar thought to add a new parameter "project". But we had few concerns about it, then:
Hence, the current changes were aimed to not change the current behaviour of the As you've now suggested to remove the "owner" completely, just a few questions:
Thanks |
You make a good point about backwards compatibility, and I think it should be easy to handle in inside the method. For now, you may need to keep the My other point is not to require the user to send URLs, but allow for a Project() instance to be received, I would like to support arbitrary paths too, but you'll risk it that the path may not exist and have to then take care of creating and such, best to have that responsibility outside of this Also, add a test checking your payload. |
scrunch/datasets.py
Outdated
if preserve_owner: | ||
body['owner'] = self.resource.body.owner | ||
elif owner: | ||
body["owner"] = ( | ||
owner if owner.startswith("http") else get_project(owner).url |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be encouraged to be Crunch objects rather than strings. For projects don't allow strings, urls or paths. Because you'll have problems validating here.
Have this be a Project instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: @jjdelc Double-checking on this - Isn't get_project() validating the string for us? It will return a Project
instance if the path can be resolved, and raise an InvalidPathError
it cannot:
InvalidPathError Traceback (most recent call last)
Cell In[8], line 7
4 ygdc_crunch.Crunch()
6 name_pr = "TechOps|Alex 3"
----> 7 pr = scrunch.get_project(name_pr)
File ~\AppData\Local\Continuum\anaconda2\envs\py311test\Lib\site-packages\scrunch\datasets.py:206, in get_project(project, connection)
203 _project = Project(ret)
205 if sub_project:
--> 206 _project = _project.get(sub_project)
208 return _project
File ~\AppData\Local\Continuum\anaconda2\envs\py311test\Lib\site-packages\scrunch\datasets.py:552, in Project.get(self, path)
550 for p_name in Path(path).get_parts():
551 try:
--> 552 node = node.get_child(p_name)
553 except KeyError:
554 raise InvalidPathError('Project not found %s' % p_name)
File ~\AppData\Local\Continuum\anaconda2\envs\py311test\Lib\site-packages\scrunch\datasets.py:571, in Project.get_child(self, name)
568 return Project(tup.entity)
569 return self.root.dataset[name]
--> 571 raise InvalidPathError('Invalid path: %s' % name)
InvalidPathError: Invalid path: Alex 3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you're right. I missed that part of the if
then yes this is correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jjdelc - so a green light on allowing path strings?
scrunch/datasets.py
Outdated
@@ -2412,14 +2409,22 @@ def fork(self, description=None, name=None, is_published=False, | |||
**kwargs | |||
) | |||
|
|||
owner = kwargs.get("owner") | |||
if preserve_owner: | |||
body['owner'] = self.resource.body.owner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If they choose to preserve owner, what they mean is that it should be in the same place as the current project. The API already returns the project URL for the owner
attribute. So this may also be self.resource.project
.
And we can slowly stop the conversaion about owners, because that concept does not exist in the API anymore and will bring confusion to users that individuals have ownership of datasets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: @jjdelc I am a bit confused here - We want to keep the preserve_owner
kwarg and add a new one, project
, explicitly to the signature by your suggestion; then basically fall back to self.resource.project
(simple code change to abandon the old owner
property) if project
is not given? Just making sure we get this right!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, really at this point owner and project are equivalent. There is no concept of an individual owner in Crunch.
It looks like preserve_owner
is the default behavior when no owner is passed in, if preserve_owner is False, and no owner is passed, then there's no explicit code path for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for confirming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jjdelc , the self.resource.project.self
fails when forking a dataset into the personal project., i.e., when the preserve_owner = True for a dataset in the personal project.
Hence, I've kept it unchanged for now i.e, body['owner'] = self.resource.body.owner
.
Additionally, I've added the project
parameter as suggested and also have added the test in test_dataset.py
:
Also, I've added a few validations,
scrunch/datasets.py
Outdated
try: | ||
_fork = self.resource.forks.create(payload).refresh() | ||
except TaskProgressTimeoutError as exc: | ||
_fork = exc.entity.wait_progress(exc.response).refresh() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you get a timeout, you should continue to raise, this is an invalid call with the exception. If it timed out it quite likely failed, and the code flow shouldn't continue bc you won't have a fork to carry on with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: @jjdelc Do you refer here to catching the timeout inside the fork() method directly? Because it is common practice for users to catch the timeouts on forks with a progress tracking against a larger timeout. We simply do not want to have this inside the method here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean is that if pycrunch raised a timeout, there's no recovery. Calling .wait_progress
with the response of the timeout is not going to work, and it won't achieve anything (maybe some python error) but this line looks like it is attempting some fallback way to recover from the timeout and still obtain a _fork
instance from it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep - got it now, @jjdelc! Make sense. @shaikh-ma I think if we would want control the timeout AND also return the fork when it completed, we would need to do this:
... < catch & wait >
entity = exc.entity.refresh()
fork_ds = scrunch.get_mutable_dataset(entity.body.id)
I think this should work (at least I think that's what we do in other scenarios)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jjdelc Looks like handling the timeout seems to work in our example, as demonstrated by @shaikh-ma - what do you think? I guess this is the last open point, then we should be good? The test has also been added.
Appreciate your time (when you can give it some 👀)! Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this looks like it is continuing to wait after the client has timed out, say pycrunch's timeout is 2m and the operation is taking longer than that.
On the first exception the client will raise TimeOut, but the server may still carry on updating the progress. So that's what that 2nd attempt is doing, it's giving it another chance to continue. That 2nd one on its own may also raise Timeout in case it takes longer than, now, 4 minutes. And you'll be back at the same situation.
This is equivalent of saying timeout = 2*default_timeout
. You've effectively waited 2 times.
It's a way to get around longer tasks, but code wise it reads very strange as it is not clear what the intention of this 2nd attempt is. Maybe add some comments, or even better if you want to increase the timeout then best to configure a longer timeout all together
Co-authored-by: Alessio Izzo <[email protected]>
scrunch/datasets.py
Outdated
try: | ||
_fork = self.resource.forks.create(payload).refresh() | ||
except TaskProgressTimeoutError as exc: | ||
_fork = exc.entity.wait_progress(exc.response).refresh() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this looks like it is continuing to wait after the client has timed out, say pycrunch's timeout is 2m and the operation is taking longer than that.
On the first exception the client will raise TimeOut, but the server may still carry on updating the progress. So that's what that 2nd attempt is doing, it's giving it another chance to continue. That 2nd one on its own may also raise Timeout in case it takes longer than, now, 4 minutes. And you'll be back at the same situation.
This is equivalent of saying timeout = 2*default_timeout
. You've effectively waited 2 times.
It's a way to get around longer tasks, but code wise it reads very strange as it is not clear what the intention of this 2nd attempt is. Maybe add some comments, or even better if you want to increase the timeout then best to configure a longer timeout all together
scrunch/tests/test_datasets.py
Outdated
sess = MagicMock() | ||
body = JSONObject({ | ||
'name': 'ds name', | ||
'description': 'ds description', | ||
'owner': user_id, | ||
'owner': project_id, | ||
# 'owner': user_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment needs to go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed the comment
scrunch/tests/test_datasets.py
Outdated
@@ -1795,7 +1846,8 @@ def test_fork_preserve_owner(self): | |||
'body': { | |||
'name': 'FORK #1 of ds name', | |||
'description': 'ds description', | |||
'owner': user_id, # Owner preserved | |||
# 'owner': user_id, # Owner preserved | |||
'owner': project_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's best to start updating the tests to use project
to make sure anybody else following the implementation starts learning this is the preferred way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jjdelc ,
I've updated the try-catch
block to set timeout=None to avoid getting timed out again & have also added a comment stating the purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I continue to worry that this re-attempt strategy is a bad approach. What this is doing is capturing the 1st timeout and then trying again.
Recall that this is a client-side timeout, say 60s so in case the opreation in the backend ends after 61s, the client will time out, and raise an error, then the client will try again a 2nd time and proceed for a next creation.
I think that if the issue is to reduce timeout changes, this should simply set a long timeout on the very fist request. Instead of making 2 independent creation requests to the server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jjdelc,
As we've already completed the main goal of this PR, which involves allowing Project paths & URL,
I think it would be a good idea to explore timeout handling separately.
It might need determining what timeout value might be the best fit & how this will affect the codebase/performance etc, and to add a test for testing this scenario.
Therefore, I've removed the timeout handling part from the fork()
method as this step is just an extra precaution to handle timeouts, but it's not the primary focus of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good call @shaikh-ma , I think this is best, if the fork starts to timeout, we can address that later.
Merged on master, I'll make another minor release to include this |
The
owner
parameter of thefork()
method allows creating a dataset fork in a defined project location.But to make it work the
owner
parameter always needs to be the Project URL e.g.,https://<company>.crunch.io/api/projects/<project_id>/
.This often requires the user, an additional step, to manually find out the project URL to pass as the
owner
parameter.It is comparatively easier to work with the project folder paths e.g.,
"Client datasets | Projects A | Project A - 1 "
than the project URL, which also is more human friendly.Hence, this Pull Request adds a new feature to
fork()
method to allow the user to pass the project path to theowner
parameter as well, as requested in the issue.The previous functionality of the
fork()
method is still the same, only the behavior of theowner
parameter has been updated.On passing project path (instead of a project URL), the
fork()
method now automatically maps the folder path to its respective Project URL. Thus, preventing the user an additional step for finding out the Project URL manually.Examples:
Additionally, this PR handles the timeout error (
TaskProgressTimeoutError
) which often occurs when forking large datasets.