Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allowing Project Paths to be passed to scrunch.dataset.fork() method #463

Merged

Conversation

shaikh-ma
Copy link
Contributor

@shaikh-ma shaikh-ma commented Jul 25, 2024

The owner parameter of the fork() method allows creating a dataset fork in a defined project location.
But to make it work the owner parameter always needs to be the Project URL e.g., https://<company>.crunch.io/api/projects/<project_id>/.

This often requires the user, an additional step, to manually find out the project URL to pass as the owner parameter.

It is comparatively easier to work with the project folder paths e.g., "Client datasets | Projects A | Project A - 1 " than the project URL, which also is more human friendly.
Hence, this Pull Request adds a new feature to fork() method to allow the user to pass the project path to the owner parameter as well, as requested in the issue.

The previous functionality of the fork() method is still the same, only the behavior of the owner parameter has been updated.

On passing project path (instead of a project URL), the fork() method now automatically maps the folder path to its respective Project URL. Thus, preventing the user an additional step for finding out the Project URL manually.

Examples:

import scrunch

ds = scrunch.get_dataset("Dataset X", project="Project A")

# Passing project path to `owner`
ds_fork = ds.fork(preserve_owner=False, owner="Project B | Project 1")


# Passing project URL to `owner`
ds_fork = ds.fork(preserve_owner=False, owner="https://<company>.crunch.io/api/projects/<project_id>/")

Additionally, this PR handles the timeout error (TaskProgressTimeoutError) which often occurs when forking large datasets.

@shaikh-ma shaikh-ma changed the title Task/update fork method to allow path string Task/update fork method to allow Project Paths Jul 25, 2024
@shaikh-ma shaikh-ma changed the title Task/update fork method to allow Project Paths Allowing Project Paths to be passed to fork() methods Jul 26, 2024
@shaikh-ma shaikh-ma marked this pull request as draft July 26, 2024 12:00
@shaikh-ma shaikh-ma marked this pull request as ready for review August 1, 2024 08:38
@shaikh-ma shaikh-ma changed the title Allowing Project Paths to be passed to fork() methods Allowing Project Paths to be passed to scrunch.dataset.fork() method Aug 2, 2024
@shaikh-ma
Copy link
Contributor Author

Hello @jjdelc, @aless10, 👋

Please could you take a look at this PR, when you get a chance.

Thanks

@jamesrkg jamesrkg requested review from jjdelc and aless10 August 14, 2024 07:35
Copy link
Contributor

@jjdelc jjdelc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention of this ticket is correct, it is a good direction to move away from dataset "owners" and think of them as living inside a project.

But I don't think this implementation reflects that.

What I suggest is using a new argument called project and have it receive a Project instance (or maybe a project ID).

That way you can call dataset.fork(project=destination_project)

And I think that the default behavior should be:

def fork(self, ..., project=None, ...):
  project = project or self.project

This way, new forks without any location information would be created on the same project as the current dataset (the user being an editor on the dataset to issue a fork, should also have edit permissions on the project to place the new fork)

Bonus points: remove the owner stuff completely, it is not an official use case anymore, is anybody using it?

@shaikh-ma
Copy link
Contributor Author

shaikh-ma commented Aug 15, 2024

This sounds a great idea @jjdelc! 🙂

Initially @alextanski & I, had a similar thought to add a new parameter "project".

But we had few concerns about it, then:

  1. It might become confusing both the developer & the users when/how to use the owner & project parameter. And their behaviour with preserve_owner.
  2. Handling of undesired scenarios e.g., user might be passing both these parameters or none. Et cetra.

Hence, the current changes were aimed to not change the current behaviour of the fork() method while adding the needed features additionally. So as to maintain simplicity & backward compatibility.

As you've now suggested to remove the "owner" completely, just a few questions:

  1. How should we handle the scenario when the user needs the fork dataset to be created in their personal project?
  2. Should we rename or remove the preserve_owner parameter?

Thanks

@jjdelc
Copy link
Contributor

jjdelc commented Aug 19, 2024

You make a good point about backwards compatibility, and I think it should be easy to handle in inside the method.

For now, you may need to keep the owner behavior as it is. And add the new project field. Handle inside the call if both are sent raise an error and suggest to only send project, if none are sent, follow the current behavior.

My other point is not to require the user to send URLs, but allow for a Project() instance to be received, I would like to support arbitrary paths too, but you'll risk it that the path may not exist and have to then take care of creating and such, best to have that responsibility outside of this .fork() call and once they solved that they would have a Project() instance.

Also, add a test checking your payload.

if preserve_owner:
body['owner'] = self.resource.body.owner
elif owner:
body["owner"] = (
owner if owner.startswith("http") else get_project(owner).url
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be encouraged to be Crunch objects rather than strings. For projects don't allow strings, urls or paths. Because you'll have problems validating here.

Have this be a Project instance.

Copy link

@alexbuchhammer alexbuchhammer Aug 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: @jjdelc Double-checking on this - Isn't get_project() validating the string for us? It will return a Project instance if the path can be resolved, and raise an InvalidPathError it cannot:

InvalidPathError                          Traceback (most recent call last)
Cell In[8], line 7
      4 ygdc_crunch.Crunch()
      6 name_pr = "TechOps|Alex 3"
----> 7 pr = scrunch.get_project(name_pr)

File ~\AppData\Local\Continuum\anaconda2\envs\py311test\Lib\site-packages\scrunch\datasets.py:206, in get_project(project, connection)
    203 _project = Project(ret)
    205 if sub_project:
--> 206     _project = _project.get(sub_project)
    208 return _project

File ~\AppData\Local\Continuum\anaconda2\envs\py311test\Lib\site-packages\scrunch\datasets.py:552, in Project.get(self, path)
    550 for p_name in Path(path).get_parts():
    551     try:
--> 552         node = node.get_child(p_name)
    553     except KeyError:
    554         raise InvalidPathError('Project not found %s' % p_name)

File ~\AppData\Local\Continuum\anaconda2\envs\py311test\Lib\site-packages\scrunch\datasets.py:571, in Project.get_child(self, name)
    568         return Project(tup.entity)
    569     return self.root.dataset[name]
--> 571 raise InvalidPathError('Invalid path: %s' % name)

InvalidPathError: Invalid path: Alex 3

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're right. I missed that part of the if then yes this is correct

Copy link

@alexbuchhammer alexbuchhammer Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jjdelc - so a green light on allowing path strings?

@@ -2412,14 +2409,22 @@ def fork(self, description=None, name=None, is_published=False,
**kwargs
)

owner = kwargs.get("owner")
if preserve_owner:
body['owner'] = self.resource.body.owner
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they choose to preserve owner, what they mean is that it should be in the same place as the current project. The API already returns the project URL for the owner attribute. So this may also be self.resource.project.

And we can slowly stop the conversaion about owners, because that concept does not exist in the API anymore and will bring confusion to users that individuals have ownership of datasets.

Copy link

@alexbuchhammer alexbuchhammer Aug 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: @jjdelc I am a bit confused here - We want to keep the preserve_owner kwarg and add a new one, project, explicitly to the signature by your suggestion; then basically fall back to self.resource.project (simple code change to abandon the old owner property) if project is not given? Just making sure we get this right!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, really at this point owner and project are equivalent. There is no concept of an individual owner in Crunch.

It looks like preserve_owner is the default behavior when no owner is passed in, if preserve_owner is False, and no owner is passed, then there's no explicit code path for this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for confirming.

Copy link
Contributor Author

@shaikh-ma shaikh-ma Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jjdelc , the self.resource.project.self fails when forking a dataset into the personal project., i.e., when the preserve_owner = True for a dataset in the personal project.
Hence, I've kept it unchanged for now i.e, body['owner'] = self.resource.body.owner.

Additionally, I've added the project parameter as suggested and also have added the test in test_dataset.py:
Passing project
Passing owner
Not passing project and owner


Also, I've added a few validations,

Validation-preserve_owner with Project
Validation - passing both Project and Owner

try:
_fork = self.resource.forks.create(payload).refresh()
except TaskProgressTimeoutError as exc:
_fork = exc.entity.wait_progress(exc.response).refresh()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you get a timeout, you should continue to raise, this is an invalid call with the exception. If it timed out it quite likely failed, and the code flow shouldn't continue bc you won't have a fork to carry on with.

Copy link

@alexbuchhammer alexbuchhammer Aug 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: @jjdelc Do you refer here to catching the timeout inside the fork() method directly? Because it is common practice for users to catch the timeouts on forks with a progress tracking against a larger timeout. We simply do not want to have this inside the method here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is that if pycrunch raised a timeout, there's no recovery. Calling .wait_progress with the response of the timeout is not going to work, and it won't achieve anything (maybe some python error) but this line looks like it is attempting some fallback way to recover from the timeout and still obtain a _fork instance from it.

Copy link

@alexbuchhammer alexbuchhammer Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep - got it now, @jjdelc! Make sense. @shaikh-ma I think if we would want control the timeout AND also return the fork when it completed, we would need to do this:

... < catch & wait >
entity = exc.entity.refresh()
fork_ds = scrunch.get_mutable_dataset(entity.body.id)

I think this should work (at least I think that's what we do in other scenarios)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line _fork = exc.entity.wait_progress(exc.response).refresh() worked as expected (without giving any error) when I tried out with a dataset.

Handling TimeOut error

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jjdelc Looks like handling the timeout seems to work in our example, as demonstrated by @shaikh-ma - what do you think? I guess this is the last open point, then we should be good? The test has also been added.

Appreciate your time (when you can give it some 👀)! Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this looks like it is continuing to wait after the client has timed out, say pycrunch's timeout is 2m and the operation is taking longer than that.

On the first exception the client will raise TimeOut, but the server may still carry on updating the progress. So that's what that 2nd attempt is doing, it's giving it another chance to continue. That 2nd one on its own may also raise Timeout in case it takes longer than, now, 4 minutes. And you'll be back at the same situation.

This is equivalent of saying timeout = 2*default_timeout. You've effectively waited 2 times.

It's a way to get around longer tasks, but code wise it reads very strange as it is not clear what the intention of this 2nd attempt is. Maybe add some comments, or even better if you want to increase the timeout then best to configure a longer timeout all together

try:
_fork = self.resource.forks.create(payload).refresh()
except TaskProgressTimeoutError as exc:
_fork = exc.entity.wait_progress(exc.response).refresh()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this looks like it is continuing to wait after the client has timed out, say pycrunch's timeout is 2m and the operation is taking longer than that.

On the first exception the client will raise TimeOut, but the server may still carry on updating the progress. So that's what that 2nd attempt is doing, it's giving it another chance to continue. That 2nd one on its own may also raise Timeout in case it takes longer than, now, 4 minutes. And you'll be back at the same situation.

This is equivalent of saying timeout = 2*default_timeout. You've effectively waited 2 times.

It's a way to get around longer tasks, but code wise it reads very strange as it is not clear what the intention of this 2nd attempt is. Maybe add some comments, or even better if you want to increase the timeout then best to configure a longer timeout all together

sess = MagicMock()
body = JSONObject({
'name': 'ds name',
'description': 'ds description',
'owner': user_id,
'owner': project_id,
# 'owner': user_id,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment needs to go?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the comment

@@ -1795,7 +1846,8 @@ def test_fork_preserve_owner(self):
'body': {
'name': 'FORK #1 of ds name',
'description': 'ds description',
'owner': user_id, # Owner preserved
# 'owner': user_id, # Owner preserved
'owner': project_id,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's best to start updating the tests to use project to make sure anybody else following the implementation starts learning this is the preferred way

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jjdelc ,

I've updated the try-catch block to set timeout=None to avoid getting timed out again & have also added a comment stating the purpose.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I continue to worry that this re-attempt strategy is a bad approach. What this is doing is capturing the 1st timeout and then trying again.

Recall that this is a client-side timeout, say 60s so in case the opreation in the backend ends after 61s, the client will time out, and raise an error, then the client will try again a 2nd time and proceed for a next creation.

I think that if the issue is to reduce timeout changes, this should simply set a long timeout on the very fist request. Instead of making 2 independent creation requests to the server.

Copy link
Contributor Author

@shaikh-ma shaikh-ma Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jjdelc,

As we've already completed the main goal of this PR, which involves allowing Project paths & URL,
I think it would be a good idea to explore timeout handling separately.
It might need determining what timeout value might be the best fit & how this will affect the codebase/performance etc, and to add a test for testing this scenario.

Therefore, I've removed the timeout handling part from the fork() method as this step is just an extra precaution to handle timeouts, but it's not the primary focus of this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good call @shaikh-ma , I think this is best, if the fork starts to timeout, we can address that later.

@shaikh-ma
Copy link
Contributor Author

shaikh-ma commented Sep 16, 2024

Hi @jjdelc, @aless10 👋 ,

Please let me know if anything else is needed from our end or it's good to be merged. 😄

@jjdelc jjdelc merged commit c94645f into Crunch-io:master Sep 18, 2024
5 checks passed
@jjdelc
Copy link
Contributor

jjdelc commented Sep 18, 2024

Merged on master, I'll make another minor release to include this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants