ocrd_mets: add get_physical_pages(for_pageIds=...) #1063

bertsky · 2023-06-26T15:52:35Z

This could help with our per-page limit * num-pages heuristic.

If you have any idea how to directly reuse the added code in the existing find_files(pageId=...) selector (so there will be no duplication), let me know.

~~Could~~ Now also be exposed via CLI.

EDIT: also covers access to all mets:div labels now.

bertsky · 2023-06-28T15:34:37Z

Now also fixes #821 (but I'm afraid this is one of these issues where autolinking does not work somehow).

MehmedGIT

Looks good to me.

If you have any idea how to directly reuse the added code in the existing find_files(pageId=...) selector (so there will be no duplication), let me know.

The hard part about integrating the added code to find_files is that find_files is cluttered with many different cases. Almost all parameters can be either a string or a pattern. IMO, we should split that method into smaller methods for the separate search cases.

Consider this method invocation: find_files(fileGrp='DEFAULT'). Although we just want to get all files that belong to that file group without other search parameters, we must still iterate over all file elements of the file group to check the other search parameters even when they are not presented. So the invocation above still has a linear execution time instead of a constant time where we could just return the cached list.

bertsky · 2023-06-29T12:08:05Z

If you have any idea how to directly reuse the added code in the existing find_files(pageId=...) selector (so there will be no duplication), let me know.

The hard part about integrating the added code to find_files is that find_files is cluttered with many different cases.

Yes, but here we are just concerned with the pageId selection, which is in a separate section before the differentiation into other selectors. So it should be doable (but the result of the loops in find_files is a list of file IDs, whereas the result of the loops in my get_physical_pages(for_pageIds=...) is a mere list of page IDs – perhaps a common denominator would be returning the page div and then forking .get('ID') vs. .xpath('mets:fptr/@FILEID') from there).

Consider this method invocation: find_files(fileGrp='DEFAULT'). Although we just want to get all files that belong to that file group without other search parameters, we must still iterate over all file elements of the file group to check the other search parameters even when they are not presented. So the invocation above still has a linear execution time instead of a constant time where we could just return the cached list.

Ok, so you want to factor the selector conditionals out of the loop's body? But since we already have 5 individual criteria, due to combinatorial explosion you would need lots of conditionals at the top level.

MehmedGIT · 2023-06-29T13:33:17Z

... perhaps a common denominator would be returning the page div and then forking .get('ID') vs. .xpath('mets:fptr/@FILEID') from there.

I understand what you mean, probably we should add an additional boolean parameter to decide whether to return a list of divs or a list of strings from get_physical_pages() to not lose backward compatibility. Then the pageId case inside find_files will become much simpler:

pageId_list = []
if pageId:
    physical_pages = self.get_physical_pages(for_pageIds=pageId, return_divs=True)  # returns divs instead of strings of ids
    for div in physical_pages:
        if self._cache_flag:
            pageId_list += self._fptr_cache[div.get('ID')]
        else:
            pageId_list += [fptr.get('FILEID') for fptr in div.findall('mets:fptr', NS)]

Btw, I just realized some loops with caching are inefficiently implemented. The outer loop should be looping over the pageId_patterns instead of self._page_cache.keys(). The former's size is always smaller or equal to the size of the latter's size. I should probably soon revisit ocrd_mets.py again and reconsider things regarding the caching.

MehmedGIT · 2023-07-04T11:10:58Z

@bertsky, the available tests run just fine. Let me know if you face errors somewhere.

bertsky · 2023-07-04T11:29:50Z

the available tests run just fine. Let me know if you face errors somewhere.

@MehmedGIT there are some strange new indentations in e181758 but otherwise great, that's exactly what I was looking for – thanks!

MehmedGIT · 2023-07-04T11:55:09Z

the available tests run just fine. Let me know if you face errors somewhere.

@MehmedGIT there are some strange new indentations in e181758 but otherwise great, that's exactly what I was looking for – thanks!

The new indentations are intended. They are rather fixes to weird previous indentations and spaces on empty lines left from cache implementation. I should have created a separate commit to make this more obvious.

kba · 2024-01-15T13:56:52Z

I have now merged master into the PR and changed the logic for ocrd workspace list-page accordingly. The default behavior, including with regards to chunking (-C/-D) should be unchanged. The -f json output has changed because we can now have multiple fields per entry, not just a pageId but the rest should behave the same if only -k ID is provided (which is the default). Labels (@ID, @ORDER, @ORDERLABEL now, later also @CONTENTIDS) within an entry are separated by tab.

For example:

 ❯ ocrd workspace list-page -f comma-separated -k ID -k ORDER -D 3 
PHYS_0001       1,PHYS_0002     2,PHYS_0003     3,PHYS_0004     4,PHYS_0005     5,PHYS_0006     6,PHYS_0008     8,PHYS_0009     9,PHYS_0010     10
PHYS_0011       11,PHYS_0012    12,PHYS_0013    13,PHYS_0014    14,PHYS_0015    15,PHYS_0016    16,PHYS_0017    17,PHYS_0018    18,PHYS_0019    19
PHYS_0020       20,PHYS_0022    22,PHYS_0023    23,PHYS_0024    24,PHYS_0025    25,PHYS_0026    26,PHYS_0027    27,PHYS_0028    28,PHYS_0029    29

It's not as efficient as it was before the merge but I wanted to keep it simple so we can discuss in the call.

Still WIP, need to adapt for @CONTENTIDS and the update-page functionality.

bertsky · 2024-01-15T14:17:21Z

Still WIP, need to adapt for @CONTENTIDS and the update-page functionality.

Ideally also label support in the page range selector.

kba · 2024-01-16T10:55:52Z

Still WIP, need to adapt for @CONTENTIDS and the update-page functionality.

Ideally also label support in the page range selector.

@CONTENTIDS is supported, the update-page is now generic for all the attributes (ocrd workspace --set ORDER 2 --set CONTENTIDS urn:foo:... PHYS_0001).

Now working on implementing page range over labels. It's a bit more complicated because the self._page_cache currently only supports mapping from ID.

Should we make this (a) explicit (-g order:9..order:25) or (b) should this work automagically (-g 9..25).

(a) is more predictable and slightly more efficient but more complicated to implement.
(b) is easier for the user and does not require extending the range parser but can lead to strange behavior if too automagical (e.g. when matching the start to ORDER and the end to ORDERLABEL or ID which might be inconsistent)

bertsky · 2024-01-16T12:32:45Z

@CONTENTIDS is supported, the update-page is now generic for all the attributes (ocrd workspace --set ORDER 2 --set CONTENTIDS urn:foo:... PHYS_0001).

fantastic!

Now working on implementing page range over labels. It's a bit more complicated because the self._page_cache currently only supports mapping from ID.

oh, right! I forgot about the caching.

Should we make this (a) explicit (-g order:9..order:25) or (b) should this work automagically (-g 9..25).

I am in favour of b. But indeed, deviating facets for begin and end are a cause for confusion. But perhaps the implementation could be made so that begin and end must always match against the same attribute?

I would also catch the case where ORDER (or ORDERLABEL) is not unique within the document, perhaps raising an exception.

Regarding the potential confusion between facets (even when matching both start and end consistently):

ORDER with ID: does not introduce ambiguity, since the former is int and the later must not start with int
ORDERLABEL with ID: very unlikely – but perhaps a warning could be emitted on the logger?
ORDER with ORDERLABEL: same here; also, users can always switch to ID to be sure

kba · 2024-01-16T16:20:47Z

@CONTENTIDS is supported, the update-page is now generic for all the attributes (ocrd workspace --set ORDER 2 --set CONTENTIDS urn:foo:... PHYS_0001).

fantastic!

Now working on implementing page range over labels. It's a bit more complicated because the self._page_cache currently only supports mapping from ID.

oh, right! I forgot about the caching.

Should we make this (a) explicit (-g order:9..order:25) or (b) should this work automagically (-g 9..25).

I am in favour of b. But indeed, deviating facets for begin and end are a cause for confusion. But perhaps the implementation could be made so that begin and end must always match against the same attribute?

I would also catch the case where ORDER (or ORDERLABEL) is not unique within the document, perhaps raising an exception.

Regarding the potential confusion between facets (even when matching both start and end consistently):

ORDER with ID: does not introduce ambiguity, since the former is int and the later must not start with int

ORDERLABEL with ID: very unlikely – but perhaps a warning could be emitted on the logger?

ORDER with ORDERLABEL: same here; also, users can always switch to ID to be sure

OK, I have implemented the range behavior over any of the page mets:div attributes, i.e. you can now do -g 1..10 and it will use, in order of definition of METS_PAGE_DIV_ATTRIBUTE, ID, ORDER, ORDERLABEL, LABEL, CONTENTIDS.

OcrdMets._page_cache is now a dict[METS_PAGE_DIV_ATTRIBUTE, dict[str, str]], ie. there is one for every attribute.

It was a fairly convoluted process to make this generic solution, so additional eyes on the changes much appreciated @bertsky @MehmedGIT I have to pause now because my brain hurts with all the any and list comprehensions etc.

kba · 2024-01-17T12:36:55Z

Should we make this (a) explicit (-g order:9..order:25) or (b) should this work automagically (-g 9..25).

I am in favour of b. But indeed, deviating facets for begin and end are a cause for confusion. But perhaps the implementation could be made so that begin and end must always match against the same attribute?

b) has been implemented, you can do 1..10 or PHYS_0001..PHYS_0010 or page 1..page 10.

The attribute to use is determined once and then used consistently for matching the attributes.

Moreover, I've added a check to generate_range that raises a ValueError if the non-numeric parts of a range do not match, to avoid ranges like PHYS_0001..page 10 on a syntactical level.

I would also catch the case where ORDER (or ORDERLABEL) is not unique within the document, perhaps raising an exception.

This is feasible when caching is enabled but expensive without caching because we'd need to check every attribute against all the other attributes.

I could add a warning when this is the case during the cache fill, if that helps.

Regarding the potential confusion between facets (even when matching both start and end consistently):

ORDER with ID: does not introduce ambiguity, since the former is int and the later must not start with int

ORDERLABEL with ID: very unlikely – but perhaps a warning could be emitted on the logger?

ORDER with ORDERLABEL: same here; also, users can always switch to ID to be sure

As I said above, mixing attribute sources for matching against the patterns should be prevented by checking the attribute to check before iterating over the pages (caching) or on the first match (no caching) on a programmatic level and with the generate_range requiring matching non-numeric parts on a syntactical level.

I think this is ready for merge unless you think I missed something.

bertsky · 2024-01-17T12:59:17Z

I would also catch the case where ORDER (or ORDERLABEL) is not unique within the document, perhaps raising an exception.

This is feasible when caching is enabled but expensive without caching because we'd need to check every attribute against all the other attributes.

I could add a warning when this is the case during the cache fill, if that helps.

Oh, of course. No, during cache fill would only confuse users I'm afraid. (Esp. with METS server, the warning may be distant from the conflicting request.)

Regarding the potential confusion between facets (even when matching both start and end consistently):

ORDER with ID: does not introduce ambiguity, since the former is int and the later must not start with int

ORDERLABEL with ID: very unlikely – but perhaps a warning could be emitted on the logger?

ORDER with ORDERLABEL: same here; also, users can always switch to ID to be sure

As I said above, mixing attribute sources for matching against the patterns should be prevented by checking the attribute to check before iterating over the pages (caching) or on the first match (no caching) on a programmatic level and with the generate_range requiring matching non-numeric parts on a syntactical level.

That paragraph was not about mixing facets, but about confusion between match and user intent. (For example, the user meant ORDERLABEL, but ID matches, perhaps from another page.)

ocrd_models/ocrd_models/ocrd_mets.py

# Conflicts: # src/ocrd_models/constants.py

bertsky

better but not quite ready IMHO

src/ocrd_models/ocrd_mets.py

Co-authored-by: Robert Sachunsky <[email protected]>

…gex case

… pages

…-cached branch

# Conflicts: # src/ocrd_models/ocrd_mets.py # tests/test_mets_server.py

ocrd_mets: add get_physical_pages(for_pageIds=...)

1e3e702

bertsky requested review from MehmedGIT and kba June 26, 2023 15:52

bertsky added 3 commits June 26, 2023 18:53

ocrd workspace list-page: --page-id option

07a9fe0

ocrd_mets: expose property physical_pages_labels

25854c5

ocrd workspace list-page: add --output-field, delegating to page labels

ccb51ce

MehmedGIT approved these changes Jun 29, 2023

View reviewed changes

get phys pages returns strs or divs

e181758

bertsky mentioned this pull request Jul 12, 2023

ocrd network: proper timeouts for processing #1074

Open

bertsky mentioned this pull request Dec 7, 2023

workspace list-page: show label #821

Closed

bertsky mentioned this pull request Dec 14, 2023

WorkspaceBagger: Use, in order of preference, f.basename, f.contentids and f.ID for filenames #1157

Open

merge master and adapt to page-range output changes

26b64c9

update list-page-workspace with @order

073d9b0

kba added 3 commits January 15, 2024 19:41

add typing info for caches in OcrdMets

e91cf50

more complete test workspace for page labelling/partitioning

c642d04

replace update-page with a cleaner solution based on get_physical_pages

9dea95f

kba added 2 commits January 16, 2024 14:53

OcrdMets: extend the _page_cache to include all METS_PAGE_DIV_ATTRIBUTEs

cfd1c91

implement generic page attribute ranges

ee8fb69

utils.generate_range: raise a ValueError if non-numeric parts differ

1427c07

fix tests

c36360d

revert accidental commit to ocrd_utils/pyproject.toml

3a60c1f

bertsky commented Jan 17, 2024

View reviewed changes

ocrd_models/ocrd_models/ocrd_mets.py Outdated Show resolved Hide resolved

ocrd_models/ocrd_models/ocrd_mets.py Outdated Show resolved Hide resolved

ocrd_models/ocrd_models/ocrd_mets.py Outdated Show resolved Hide resolved

ocrd_models/ocrd_models/ocrd_mets.py Outdated Show resolved Hide resolved

kba added 3 commits January 30, 2024 19:31

Merge branch 'master' into ocrd-mets-get-pages-for-pageids

643d1ef

# Conflicts: # src/ocrd_models/constants.py

get_physical_pages: return early if no patterns

517814b

OcrdMets.find_all_files: fix page attr loop

1225912

bertsky commented Feb 6, 2024

View reviewed changes

kba and others added 7 commits February 8, 2024 11:08

OcrdMets.get_physical_pages should return IDs if not return_divs

4a25d1e

Co-authored-by: Robert Sachunsky <[email protected]>

OcrdMets.get_physical_pages: Cache the attribute in the non-cached re…

466c61d

…gex case

OcrdMets.get_physical_pages: raise ValueError if a pattern matches no…

9f84067

… pages

OcrdMets.get_physical_pages: iterate over pages, then patterns in non…

2647831

…-cached branch

adapt tests to stricter page pattern matching

28a1f18

OcrdMets.get_physical_pages: raise ValueError if range start not matched

c6cfe03

Merge branch 'master' into ocrd-mets-get-pages-for-pageids

8e06532

# Conflicts: # src/ocrd_models/ocrd_mets.py # tests/test_mets_server.py

kba merged commit 89a5446 into OCR-D:master Feb 12, 2024
22 checks passed

kba mentioned this pull request Feb 19, 2024

Refactor network module - prepare it for easier sub-module testing #1191

Merged

bertsky mentioned this pull request Mar 1, 2024

Invalid structMap produced #1195

Closed

bertsky deleted the ocrd-mets-get-pages-for-pageids branch June 6, 2024 14:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ocrd_mets: add get_physical_pages(for_pageIds=...) #1063

ocrd_mets: add get_physical_pages(for_pageIds=...) #1063

bertsky commented Jun 26, 2023 •

edited

Loading

bertsky commented Jun 28, 2023

MehmedGIT left a comment •

edited

Loading

bertsky commented Jun 29, 2023

MehmedGIT commented Jun 29, 2023 •

edited

Loading

MehmedGIT commented Jul 4, 2023

bertsky commented Jul 4, 2023

MehmedGIT commented Jul 4, 2023

kba commented Jan 15, 2024

bertsky commented Jan 15, 2024

kba commented Jan 16, 2024

bertsky commented Jan 16, 2024

kba commented Jan 16, 2024

kba commented Jan 17, 2024

bertsky commented Jan 17, 2024

bertsky left a comment

ocrd_mets: add get_physical_pages(for_pageIds=...) #1063

ocrd_mets: add get_physical_pages(for_pageIds=...) #1063

Conversation

bertsky commented Jun 26, 2023 • edited Loading

bertsky commented Jun 28, 2023

MehmedGIT left a comment • edited Loading

Choose a reason for hiding this comment

bertsky commented Jun 29, 2023

MehmedGIT commented Jun 29, 2023 • edited Loading

MehmedGIT commented Jul 4, 2023

bertsky commented Jul 4, 2023

MehmedGIT commented Jul 4, 2023

kba commented Jan 15, 2024

bertsky commented Jan 15, 2024

kba commented Jan 16, 2024

bertsky commented Jan 16, 2024

kba commented Jan 16, 2024

kba commented Jan 17, 2024

bertsky commented Jan 17, 2024

bertsky left a comment

Choose a reason for hiding this comment

bertsky commented Jun 26, 2023 •

edited

Loading

MehmedGIT left a comment •

edited

Loading

MehmedGIT commented Jun 29, 2023 •

edited

Loading