-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ocrd_mets: add get_physical_pages(for_pageIds=...) #1063
Conversation
Now also fixes #821 (but I'm afraid this is one of these issues where autolinking does not work somehow). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
If you have any idea how to directly reuse the added code in the existing find_files(pageId=...) selector (so there will be no duplication), let me know.
The hard part about integrating the added code to find_files
is that find_files
is cluttered with many different cases. Almost all parameters can be either a string or a pattern. IMO, we should split that method into smaller methods for the separate search cases.
Consider this method invocation: find_files(fileGrp='DEFAULT')
. Although we just want to get all files that belong to that file group without other search parameters, we must still iterate over all file elements of the file group to check the other search parameters even when they are not presented. So the invocation above still has a linear execution time instead of a constant time where we could just return the cached list.
Yes, but here we are just concerned with the pageId selection, which is in a separate section before the differentiation into other selectors. So it should be doable (but the result of the loops in
Ok, so you want to factor the selector conditionals out of the loop's body? But since we already have 5 individual criteria, due to combinatorial explosion you would need lots of conditionals at the top level. |
I understand what you mean, probably we should add an additional boolean parameter to decide whether to return a list of divs or a list of strings from pageId_list = []
if pageId:
physical_pages = self.get_physical_pages(for_pageIds=pageId, return_divs=True) # returns divs instead of strings of ids
for div in physical_pages:
if self._cache_flag:
pageId_list += self._fptr_cache[div.get('ID')]
else:
pageId_list += [fptr.get('FILEID') for fptr in div.findall('mets:fptr', NS)] Btw, I just realized some loops with caching are inefficiently implemented. The outer loop should be looping over the |
@bertsky, the available tests run just fine. Let me know if you face errors somewhere. |
@MehmedGIT there are some strange new indentations in e181758 but otherwise great, that's exactly what I was looking for – thanks! |
The new indentations are intended. They are rather fixes to weird previous indentations and spaces on empty lines left from cache implementation. I should have created a separate commit to make this more obvious. |
I have now merged master into the PR and changed the logic for For example:
It's not as efficient as it was before the merge but I wanted to keep it simple so we can discuss in the call. Still WIP, need to adapt for |
Ideally also label support in the page range selector. |
Now working on implementing page range over labels. It's a bit more complicated because the Should we make this (a) explicit ( (a) is more predictable and slightly more efficient but more complicated to implement. |
fantastic!
oh, right! I forgot about the caching.
I am in favour of b. But indeed, deviating facets for begin and end are a cause for confusion. But perhaps the implementation could be made so that begin and end must always match against the same attribute? I would also catch the case where ORDER (or ORDERLABEL) is not unique within the document, perhaps raising an exception. Regarding the potential confusion between facets (even when matching both start and end consistently):
|
OK, I have implemented the range behavior over any of the page mets:div attributes, i.e. you can now do
It was a fairly convoluted process to make this generic solution, so additional eyes on the changes much appreciated @bertsky @MehmedGIT I have to pause now because my brain hurts with all the |
b) has been implemented, you can do The attribute to use is determined once and then used consistently for matching the attributes. Moreover, I've added a check to
This is feasible when caching is enabled but expensive without caching because we'd need to check every attribute against all the other attributes. I could add a warning when this is the case during the cache fill, if that helps.
As I said above, mixing attribute sources for matching against the patterns should be prevented by checking the attribute to check before iterating over the pages (caching) or on the first match (no caching) on a programmatic level and with the generate_range requiring matching non-numeric parts on a syntactical level. I think this is ready for merge unless you think I missed something. |
Oh, of course. No, during cache fill would only confuse users I'm afraid. (Esp. with METS server, the warning may be distant from the conflicting request.)
That paragraph was not about mixing facets, but about confusion between match and user intent. (For example, the user meant ORDERLABEL, but ID matches, perhaps from another page.) |
# Conflicts: # src/ocrd_models/constants.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better but not quite ready IMHO
Co-authored-by: Robert Sachunsky <[email protected]>
# Conflicts: # src/ocrd_models/ocrd_mets.py # tests/test_mets_server.py
This could help with our
per-page limit * num-pages
heuristic.If you have any idea how to directly reuse the added code in the existing
find_files(pageId=...)
selector (so there will be no duplication), let me know.CouldNow also be exposed via CLI.EDIT: also covers access to all mets:div labels now.