Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ocrd_mets: add get_physical_pages(for_pageIds=...) #1063

Merged
merged 25 commits into from
Feb 12, 2024

Conversation

bertsky
Copy link
Collaborator

@bertsky bertsky commented Jun 26, 2023

This could help with our per-page limit * num-pages heuristic.

If you have any idea how to directly reuse the added code in the existing find_files(pageId=...) selector (so there will be no duplication), let me know.

Could Now also be exposed via CLI.

EDIT: also covers access to all mets:div labels now.

@bertsky bertsky requested review from MehmedGIT and kba June 26, 2023 15:52
@bertsky
Copy link
Collaborator Author

bertsky commented Jun 28, 2023

Now also fixes #821 (but I'm afraid this is one of these issues where autolinking does not work somehow).

Copy link
Contributor

@MehmedGIT MehmedGIT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

If you have any idea how to directly reuse the added code in the existing find_files(pageId=...) selector (so there will be no duplication), let me know.

The hard part about integrating the added code to find_files is that find_files is cluttered with many different cases. Almost all parameters can be either a string or a pattern. IMO, we should split that method into smaller methods for the separate search cases.

Consider this method invocation: find_files(fileGrp='DEFAULT'). Although we just want to get all files that belong to that file group without other search parameters, we must still iterate over all file elements of the file group to check the other search parameters even when they are not presented. So the invocation above still has a linear execution time instead of a constant time where we could just return the cached list.

@bertsky
Copy link
Collaborator Author

bertsky commented Jun 29, 2023

If you have any idea how to directly reuse the added code in the existing find_files(pageId=...) selector (so there will be no duplication), let me know.

The hard part about integrating the added code to find_files is that find_files is cluttered with many different cases.

Yes, but here we are just concerned with the pageId selection, which is in a separate section before the differentiation into other selectors. So it should be doable (but the result of the loops in find_files is a list of file IDs, whereas the result of the loops in my get_physical_pages(for_pageIds=...) is a mere list of page IDs – perhaps a common denominator would be returning the page div and then forking .get('ID') vs. .xpath('mets:fptr/@FILEID') from there).

Consider this method invocation: find_files(fileGrp='DEFAULT'). Although we just want to get all files that belong to that file group without other search parameters, we must still iterate over all file elements of the file group to check the other search parameters even when they are not presented. So the invocation above still has a linear execution time instead of a constant time where we could just return the cached list.

Ok, so you want to factor the selector conditionals out of the loop's body? But since we already have 5 individual criteria, due to combinatorial explosion you would need lots of conditionals at the top level.

@MehmedGIT
Copy link
Contributor

MehmedGIT commented Jun 29, 2023

... perhaps a common denominator would be returning the page div and then forking .get('ID') vs. .xpath('mets:fptr/@FILEID') from there.

I understand what you mean, probably we should add an additional boolean parameter to decide whether to return a list of divs or a list of strings from get_physical_pages() to not lose backward compatibility. Then the pageId case inside find_files will become much simpler:

pageId_list = []
if pageId:
    physical_pages = self.get_physical_pages(for_pageIds=pageId, return_divs=True)  # returns divs instead of strings of ids
    for div in physical_pages:
        if self._cache_flag:
            pageId_list += self._fptr_cache[div.get('ID')]
        else:
            pageId_list += [fptr.get('FILEID') for fptr in div.findall('mets:fptr', NS)]

Btw, I just realized some loops with caching are inefficiently implemented. The outer loop should be looping over the pageId_patterns instead of self._page_cache.keys(). The former's size is always smaller or equal to the size of the latter's size. I should probably soon revisit ocrd_mets.py again and reconsider things regarding the caching.

@MehmedGIT
Copy link
Contributor

@bertsky, the available tests run just fine. Let me know if you face errors somewhere.

@bertsky
Copy link
Collaborator Author

bertsky commented Jul 4, 2023

the available tests run just fine. Let me know if you face errors somewhere.

@MehmedGIT there are some strange new indentations in e181758 but otherwise great, that's exactly what I was looking for – thanks!

@MehmedGIT
Copy link
Contributor

the available tests run just fine. Let me know if you face errors somewhere.

@MehmedGIT there are some strange new indentations in e181758 but otherwise great, that's exactly what I was looking for – thanks!

The new indentations are intended. They are rather fixes to weird previous indentations and spaces on empty lines left from cache implementation. I should have created a separate commit to make this more obvious.

@kba
Copy link
Member

kba commented Jan 15, 2024

I have now merged master into the PR and changed the logic for ocrd workspace list-page accordingly. The default behavior, including with regards to chunking (-C/-D) should be unchanged. The -f json output has changed because we can now have multiple fields per entry, not just a pageId but the rest should behave the same if only -k ID is provided (which is the default). Labels (@ID, @ORDER, @ORDERLABEL now, later also @CONTENTIDS) within an entry are separated by tab.

For example:

 ❯ ocrd workspace list-page -f comma-separated -k ID -k ORDER -D 3 
PHYS_0001       1,PHYS_0002     2,PHYS_0003     3,PHYS_0004     4,PHYS_0005     5,PHYS_0006     6,PHYS_0008     8,PHYS_0009     9,PHYS_0010     10
PHYS_0011       11,PHYS_0012    12,PHYS_0013    13,PHYS_0014    14,PHYS_0015    15,PHYS_0016    16,PHYS_0017    17,PHYS_0018    18,PHYS_0019    19
PHYS_0020       20,PHYS_0022    22,PHYS_0023    23,PHYS_0024    24,PHYS_0025    25,PHYS_0026    26,PHYS_0027    27,PHYS_0028    28,PHYS_0029    29

It's not as efficient as it was before the merge but I wanted to keep it simple so we can discuss in the call.

Still WIP, need to adapt for @CONTENTIDS and the update-page functionality.

@bertsky
Copy link
Collaborator Author

bertsky commented Jan 15, 2024

Still WIP, need to adapt for @CONTENTIDS and the update-page functionality.

Ideally also label support in the page range selector.

@kba
Copy link
Member

kba commented Jan 16, 2024

Still WIP, need to adapt for @CONTENTIDS and the update-page functionality.

Ideally also label support in the page range selector.

@CONTENTIDS is supported, the update-page is now generic for all the attributes (ocrd workspace --set ORDER 2 --set CONTENTIDS urn:foo:... PHYS_0001).

Now working on implementing page range over labels. It's a bit more complicated because the self._page_cache currently only supports mapping from ID.

Should we make this (a) explicit (-g order:9..order:25) or (b) should this work automagically (-g 9..25).

(a) is more predictable and slightly more efficient but more complicated to implement.
(b) is easier for the user and does not require extending the range parser but can lead to strange behavior if too automagical (e.g. when matching the start to ORDER and the end to ORDERLABEL or ID which might be inconsistent)

@bertsky
Copy link
Collaborator Author

bertsky commented Jan 16, 2024

@CONTENTIDS is supported, the update-page is now generic for all the attributes (ocrd workspace --set ORDER 2 --set CONTENTIDS urn:foo:... PHYS_0001).

fantastic!

Now working on implementing page range over labels. It's a bit more complicated because the self._page_cache currently only supports mapping from ID.

oh, right! I forgot about the caching.

Should we make this (a) explicit (-g order:9..order:25) or (b) should this work automagically (-g 9..25).

I am in favour of b. But indeed, deviating facets for begin and end are a cause for confusion. But perhaps the implementation could be made so that begin and end must always match against the same attribute?

I would also catch the case where ORDER (or ORDERLABEL) is not unique within the document, perhaps raising an exception.

Regarding the potential confusion between facets (even when matching both start and end consistently):

  • ORDER with ID: does not introduce ambiguity, since the former is int and the later must not start with int
  • ORDERLABEL with ID: very unlikely – but perhaps a warning could be emitted on the logger?
  • ORDER with ORDERLABEL: same here; also, users can always switch to ID to be sure

@kba
Copy link
Member

kba commented Jan 16, 2024

@CONTENTIDS is supported, the update-page is now generic for all the attributes (ocrd workspace --set ORDER 2 --set CONTENTIDS urn:foo:... PHYS_0001).

fantastic!

Now working on implementing page range over labels. It's a bit more complicated because the self._page_cache currently only supports mapping from ID.

oh, right! I forgot about the caching.

Should we make this (a) explicit (-g order:9..order:25) or (b) should this work automagically (-g 9..25).

I am in favour of b. But indeed, deviating facets for begin and end are a cause for confusion. But perhaps the implementation could be made so that begin and end must always match against the same attribute?

I would also catch the case where ORDER (or ORDERLABEL) is not unique within the document, perhaps raising an exception.

Regarding the potential confusion between facets (even when matching both start and end consistently):

  • ORDER with ID: does not introduce ambiguity, since the former is int and the later must not start with int
  • ORDERLABEL with ID: very unlikely – but perhaps a warning could be emitted on the logger?
  • ORDER with ORDERLABEL: same here; also, users can always switch to ID to be sure

OK, I have implemented the range behavior over any of the page mets:div attributes, i.e. you can now do -g 1..10 and it will use, in order of definition of METS_PAGE_DIV_ATTRIBUTE, ID, ORDER, ORDERLABEL, LABEL, CONTENTIDS.

OcrdMets._page_cache is now a dict[METS_PAGE_DIV_ATTRIBUTE, dict[str, str]], ie. there is one for every attribute.

It was a fairly convoluted process to make this generic solution, so additional eyes on the changes much appreciated @bertsky @MehmedGIT I have to pause now because my brain hurts with all the any and list comprehensions etc.

@kba
Copy link
Member

kba commented Jan 17, 2024

Should we make this (a) explicit (-g order:9..order:25) or (b) should this work automagically (-g 9..25).

I am in favour of b. But indeed, deviating facets for begin and end are a cause for confusion. But perhaps the implementation could be made so that begin and end must always match against the same attribute?

b) has been implemented, you can do 1..10 or PHYS_0001..PHYS_0010 or page 1..page 10.

The attribute to use is determined once and then used consistently for matching the attributes.

Moreover, I've added a check to generate_range that raises a ValueError if the non-numeric parts of a range do not match, to avoid ranges like PHYS_0001..page 10 on a syntactical level.

I would also catch the case where ORDER (or ORDERLABEL) is not unique within the document, perhaps raising an exception.

This is feasible when caching is enabled but expensive without caching because we'd need to check every attribute against all the other attributes.

I could add a warning when this is the case during the cache fill, if that helps.

Regarding the potential confusion between facets (even when matching both start and end consistently):

  • ORDER with ID: does not introduce ambiguity, since the former is int and the later must not start with int
  • ORDERLABEL with ID: very unlikely – but perhaps a warning could be emitted on the logger?
  • ORDER with ORDERLABEL: same here; also, users can always switch to ID to be sure

As I said above, mixing attribute sources for matching against the patterns should be prevented by checking the attribute to check before iterating over the pages (caching) or on the first match (no caching) on a programmatic level and with the generate_range requiring matching non-numeric parts on a syntactical level.

I think this is ready for merge unless you think I missed something.

@bertsky
Copy link
Collaborator Author

bertsky commented Jan 17, 2024

I would also catch the case where ORDER (or ORDERLABEL) is not unique within the document, perhaps raising an exception.

This is feasible when caching is enabled but expensive without caching because we'd need to check every attribute against all the other attributes.

I could add a warning when this is the case during the cache fill, if that helps.

Oh, of course. No, during cache fill would only confuse users I'm afraid. (Esp. with METS server, the warning may be distant from the conflicting request.)

Regarding the potential confusion between facets (even when matching both start and end consistently):

  • ORDER with ID: does not introduce ambiguity, since the former is int and the later must not start with int
  • ORDERLABEL with ID: very unlikely – but perhaps a warning could be emitted on the logger?
  • ORDER with ORDERLABEL: same here; also, users can always switch to ID to be sure

As I said above, mixing attribute sources for matching against the patterns should be prevented by checking the attribute to check before iterating over the pages (caching) or on the first match (no caching) on a programmatic level and with the generate_range requiring matching non-numeric parts on a syntactical level.

That paragraph was not about mixing facets, but about confusion between match and user intent. (For example, the user meant ORDERLABEL, but ID matches, perhaps from another page.)

ocrd_models/ocrd_models/ocrd_mets.py Outdated Show resolved Hide resolved
ocrd_models/ocrd_models/ocrd_mets.py Outdated Show resolved Hide resolved
ocrd_models/ocrd_models/ocrd_mets.py Outdated Show resolved Hide resolved
ocrd_models/ocrd_models/ocrd_mets.py Outdated Show resolved Hide resolved
Copy link
Collaborator Author

@bertsky bertsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better but not quite ready IMHO

src/ocrd_models/ocrd_mets.py Outdated Show resolved Hide resolved
src/ocrd_models/ocrd_mets.py Outdated Show resolved Hide resolved
src/ocrd_models/ocrd_mets.py Outdated Show resolved Hide resolved
src/ocrd_models/ocrd_mets.py Outdated Show resolved Hide resolved
src/ocrd_models/ocrd_mets.py Outdated Show resolved Hide resolved
src/ocrd_models/ocrd_mets.py Outdated Show resolved Hide resolved
src/ocrd_models/ocrd_mets.py Show resolved Hide resolved
src/ocrd_models/ocrd_mets.py Outdated Show resolved Hide resolved
src/ocrd_models/ocrd_mets.py Outdated Show resolved Hide resolved
src/ocrd_models/ocrd_mets.py Outdated Show resolved Hide resolved
@kba kba merged commit 89a5446 into OCR-D:master Feb 12, 2024
22 checks passed
@bertsky bertsky deleted the ocrd-mets-get-pages-for-pageids branch June 6, 2024 14:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants