-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
port processor to core v3 #130
base: main
Are you sure you want to change the base?
Conversation
# Conflicts: # qurator/eynollah/processor.py
# Conflicts: # qurator/eynollah/processor.py
# Conflicts: # setup.py
# Conflicts: # qurator/eynollah/processor.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks – LGTM!
Have not tested yet, though.
Current main
also looks very promising – will give it a try myself
qurator/eynollah/processor.py
Outdated
image_filename=page.imageFilename, | ||
image_pil=page_image |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: that filename might not be where that image came from in workspace.image_from_page
. It could well be a derived image generated by some previous processor (just not a cropped, deskewed or binarized image, because that would have changed its coordinate system).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still a bit hazy for me when image_filename
is actually used. Ideally, image_pil
should take preference and image_filename
is only for the plotter/writer, at least in the "single image mode" we're using.
One of the aspects I hope I'll be able to improve a bit with https://github.com/qurator-spk/eynollah/tree/refactoring-2024-08/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can also re-use session
across Eynollah invokations in addition to models
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, yes, but with standalone eynollah being focused on batch processing now, I am honestly not sure how/where sessions are defined for the non-dir_in
option - @vahidrezanezhad can you tell us?
qurator/eynollah/processor.py
Outdated
# if not('://' in page.imageFilename): | ||
# image_filename = next(self.workspace.mets.find_files(local_filename=page.imageFilename)).local_filename | ||
# else: | ||
# # could be a URL with file:// or truly remote | ||
# image_filename = self.workspace.download_file(next(self.workspace.mets.find_files(url=page.imageFilename))).local_filename |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# if not('://' in page.imageFilename): | |
# image_filename = next(self.workspace.mets.find_files(local_filename=page.imageFilename)).local_filename | |
# else: | |
# # could be a URL with file:// or truly remote | |
# image_filename = self.workspace.download_file(next(self.workspace.mets.find_files(url=page.imageFilename))).local_filename |
This whole effort was to ensure we can pass a working local filename, as (was) needed by Eynollah. The approach by OCR-D is Workspace.image_from_page
/ Workspace.image_from_segment
which will search for the right original or derived image, download it if necessary and load it into memory.
I don't recall what the new behaviour of Eynollah is. If both an image filename and an image object are passed, who wins?
Assuming it's the memory object: this can be removed. (But then I wonder why we still pass the image filename at all...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, currently we have
if image_pil:
self._imgs = self._cache_images(image_pil=image_pil)
else:
self._imgs = self._cache_images(image_filename=image_filename)
[...]
def _cache_images(self, image_filename=None, image_pil=None):
ret = {}
if image_filename:
ret['img'] = cv2.imread(image_filename)
self.dpi = check_dpi(image_filename)
else:
ret['img'] = pil2cv(image_pil)
self.dpi = check_dpi(image_pil)
image_filename
is (should) then only used passively, to generate filenames of plotted debug images as well as for PAGE serialization.
So I think image_pil
should win but for now we need both. But as I said above, one of those things I would love to untangle in the refactoring.
Co-authored-by: Robert Sachunsky <[email protected]>
Co-authored-by: Robert Sachunsky <[email protected]>
OCR-D v3 API: fixes
BTW, I just tested under (METS Server and) I'm not sure if this warrants adding |
# Conflicts: # pyproject.toml # src/eynollah/cli.py
With this PR, eynollah supports OCR-D/core#1240. It simplifies it a lot too.
I'll update the
ocrd-tool.json
with the changed/added flags here as well.Draft, please don't merge until v3 stable is released