Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise FileExistsError in Resolver.workspace_from_url if clobber_mets is False #1268

Merged
merged 2 commits into from
Aug 7, 2024

Conversation

kba
Copy link
Member

@kba kba commented Aug 6, 2024

As laid out in #563, the clobber_mets behavior was different from what the docstring said. Instead of raising and Exception if the METS file already existed and clobber_mets being False, the METS file was silently skipped.

The easiest way would have been to just change the docstring. But the current behavior is problematic because silently skipping downloading/copying the METS XML will lead to hard-to-debug errors when users subsequently work on the wrong METS XML file.

@bertsky
Copy link
Collaborator

bertsky commented Aug 6, 2024

The easiest way would have been to just change the docstring. But the current behavior is problematic because silently skipping downloading/copying the METS XML will lead to hard-to-debug errors when users subsequently work on the wrong METS XML file.

I tend to agree, but the discussion in #425 initially outlined a different scenario:

I do not understand the use-cases for clobber_mets=False, and I know of no example of it
The original idea was that you pass a METS URL which is only downloaded the first time it is to be used. Subsequent processors should not overwrite the METS every time they process a workspace.

However, you then followed up with the following explanation, which already speaks to the current line of thought:

The METS might not be downloaded yet. That's why we introduced the idea of a workspace which was initially just a crutch to make it easier for processors and we were careful not to use it as a first-class concept e.g. in our specs. Instead of handling the intricacies of downloading full/partial representations of the work, they now just had to act on a directory and files. Since then however, workspaces have become the standard way to interact with METS/Images/PAGE and most if not all the facilities to handle non-local data outside ocrd workspace has become obsolete.

So IIUC the initial scenario of always using workspace_from_url to get a handle on METS and its contents already went out of the window, and we now have to properly differentiate between creating and using a workspace. Thus the workspace_from_url function now entirely falls into the former category, and clobber_mets=False (still the default) without a failure/exception does not make sense anymore.

Copy link
Collaborator

@bertsky bertsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@kba
Copy link
Member Author

kba commented Aug 7, 2024

So IIUC the initial scenario of always using workspace_from_url to get a handle on METS and its contents already went out of the window, and we now have to properly differentiate between creating and using a workspace. Thus the workspace_from_url function now entirely falls into the former category, and clobber_mets=False (still the default) without a failure/exception does not make sense anymore

Yeah, it is a bit messy due to poor initial planning and yes, the "clone once then work on the local workspace" approach is the default now, the "on demand" approach is largely obsolete. But I'd keep clobber_mets=False to avoid introducing a new source of mysterious behavior. I cannot really imagine a scenario where overwriting the METS during clone would be preferable to removing the workspace manually or cloning into a different directory but the flag/kwarg is there if people do need that.

@kba kba merged commit 1c8b9db into master Aug 7, 2024
22 checks passed
@kba kba deleted the fix-563 branch August 7, 2024 14:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants