Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allowing Project Paths to be passed to scrunch.dataset.fork() method #463

Merged
merged 10 commits into from
Sep 18, 2024
45 changes: 25 additions & 20 deletions scrunch/datasets.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,6 @@
pd = None


if sys.version_info.major == 3:
from collections.abc import Mapping
else:
from collections import Mapping

import six

import pycrunch
Expand Down Expand Up @@ -51,8 +46,10 @@

if six.PY2: # pragma: no cover
from urlparse import urljoin
from collections import Mapping
else:
from urllib.parse import urljoin
from collections.abc import Mapping


_MR_TYPE = 'multiple_response'
Expand Down Expand Up @@ -2387,23 +2384,23 @@ def fork(self, description=None, name=None, is_published=False,
:param preserve_owner: bool, default=True
If True, the owner of the fork will be the same as the parent
dataset otherwise the owner will be the current user in the
session and the Dataset will be set under `Persona Project`
session and the Dataset will be set under `Personal Project`.
:kwargs owner: str (optional)
When preserve_owner=False, the owner of fork will be the set
according to the passed Project URL or Folder path.

:returns _fork: scrunch.datasets.BaseDataset
"""
from scrunch.mutable_dataset import MutableDataset
nforks = len(self.resource.forks.index)

if name is None:
nforks = len(self.resource.forks.index)
dsname = self.resource.body.name
if six.PY2:
name = "FORK #{} of {}".format(
nforks + 1,
self.resource.body.name.encode("ascii", "ignore"))
else:
name = "FORK #{} of {}".format(
nforks + 1,
self.resource.body.name)
if description is None:
description = self.resource.body.description
dsname = dsname.encode("ascii", "ignore")
name = "FORK #{} of {}".format(nforks + 1, dsname)

description = description or self.resource.body.description

body = dict(
name=name,
Expand All @@ -2412,14 +2409,22 @@ def fork(self, description=None, name=None, is_published=False,
**kwargs
)

owner = kwargs.get("owner")
if preserve_owner:
body['owner'] = self.resource.body.owner
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they choose to preserve owner, what they mean is that it should be in the same place as the current project. The API already returns the project URL for the owner attribute. So this may also be self.resource.project.

And we can slowly stop the conversaion about owners, because that concept does not exist in the API anymore and will bring confusion to users that individuals have ownership of datasets.

Copy link

@alexbuchhammer alexbuchhammer Aug 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: @jjdelc I am a bit confused here - We want to keep the preserve_owner kwarg and add a new one, project, explicitly to the signature by your suggestion; then basically fall back to self.resource.project (simple code change to abandon the old owner property) if project is not given? Just making sure we get this right!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, really at this point owner and project are equivalent. There is no concept of an individual owner in Crunch.

It looks like preserve_owner is the default behavior when no owner is passed in, if preserve_owner is False, and no owner is passed, then there's no explicit code path for this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for confirming.

Copy link
Contributor Author

@shaikh-ma shaikh-ma Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jjdelc , the self.resource.project.self fails when forking a dataset into the personal project., i.e., when the preserve_owner = True for a dataset in the personal project.
Hence, I've kept it unchanged for now i.e, body['owner'] = self.resource.body.owner.

Additionally, I've added the project parameter as suggested and also have added the test in test_dataset.py:
Passing project
Passing owner
Not passing project and owner


Also, I've added a few validations,

Validation-preserve_owner with Project
Validation - passing both Project and Owner

elif owner:
body["owner"] = (
owner if owner.startswith("http") else get_project(owner).url
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be encouraged to be Crunch objects rather than strings. For projects don't allow strings, urls or paths. Because you'll have problems validating here.

Have this be a Project instance.

Copy link

@alexbuchhammer alexbuchhammer Aug 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: @jjdelc Double-checking on this - Isn't get_project() validating the string for us? It will return a Project instance if the path can be resolved, and raise an InvalidPathError it cannot:

InvalidPathError                          Traceback (most recent call last)
Cell In[8], line 7
      4 ygdc_crunch.Crunch()
      6 name_pr = "TechOps|Alex 3"
----> 7 pr = scrunch.get_project(name_pr)

File ~\AppData\Local\Continuum\anaconda2\envs\py311test\Lib\site-packages\scrunch\datasets.py:206, in get_project(project, connection)
    203 _project = Project(ret)
    205 if sub_project:
--> 206     _project = _project.get(sub_project)
    208 return _project

File ~\AppData\Local\Continuum\anaconda2\envs\py311test\Lib\site-packages\scrunch\datasets.py:552, in Project.get(self, path)
    550 for p_name in Path(path).get_parts():
    551     try:
--> 552         node = node.get_child(p_name)
    553     except KeyError:
    554         raise InvalidPathError('Project not found %s' % p_name)

File ~\AppData\Local\Continuum\anaconda2\envs\py311test\Lib\site-packages\scrunch\datasets.py:571, in Project.get_child(self, name)
    568         return Project(tup.entity)
    569     return self.root.dataset[name]
--> 571 raise InvalidPathError('Invalid path: %s' % name)

InvalidPathError: Invalid path: Alex 3

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're right. I missed that part of the if then yes this is correct

Copy link

@alexbuchhammer alexbuchhammer Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jjdelc - so a green light on allowing path strings?

)

# not returning a dataset
payload = shoji_entity_wrapper(body)
_fork = self.resource.forks.create(payload).refresh()
# return a MutableDataset always
fork_ds = MutableDataset(_fork) # Fork has same editor as current user
return fork_ds
try:
_fork = self.resource.forks.create(payload).refresh()
except TaskProgressTimeoutError as exc:
_fork = exc.entity.wait_progress(exc.response).refresh()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you get a timeout, you should continue to raise, this is an invalid call with the exception. If it timed out it quite likely failed, and the code flow shouldn't continue bc you won't have a fork to carry on with.

Copy link

@alexbuchhammer alexbuchhammer Aug 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: @jjdelc Do you refer here to catching the timeout inside the fork() method directly? Because it is common practice for users to catch the timeouts on forks with a progress tracking against a larger timeout. We simply do not want to have this inside the method here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is that if pycrunch raised a timeout, there's no recovery. Calling .wait_progress with the response of the timeout is not going to work, and it won't achieve anything (maybe some python error) but this line looks like it is attempting some fallback way to recover from the timeout and still obtain a _fork instance from it.

Copy link

@alexbuchhammer alexbuchhammer Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep - got it now, @jjdelc! Make sense. @shaikh-ma I think if we would want control the timeout AND also return the fork when it completed, we would need to do this:

... < catch & wait >
entity = exc.entity.refresh()
fork_ds = scrunch.get_mutable_dataset(entity.body.id)

I think this should work (at least I think that's what we do in other scenarios)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line _fork = exc.entity.wait_progress(exc.response).refresh() worked as expected (without giving any error) when I tried out with a dataset.

Handling TimeOut error

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jjdelc Looks like handling the timeout seems to work in our example, as demonstrated by @shaikh-ma - what do you think? I guess this is the last open point, then we should be good? The test has also been added.

Appreciate your time (when you can give it some 👀)! Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this looks like it is continuing to wait after the client has timed out, say pycrunch's timeout is 2m and the operation is taking longer than that.

On the first exception the client will raise TimeOut, but the server may still carry on updating the progress. So that's what that 2nd attempt is doing, it's giving it another chance to continue. That 2nd one on its own may also raise Timeout in case it takes longer than, now, 4 minutes. And you'll be back at the same situation.

This is equivalent of saying timeout = 2*default_timeout. You've effectively waited 2 times.

It's a way to get around longer tasks, but code wise it reads very strange as it is not clear what the intention of this 2nd attempt is. Maybe add some comments, or even better if you want to increase the timeout then best to configure a longer timeout all together


return MutableDataset(_fork)

def replace_values(self, variables, filter=None, literal_subvar=False, timeout=60):
"""
Expand Down
Loading