Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

port processor to core v3 #130

Draft
wants to merge 25 commits into
base: main
Choose a base branch
from
Draft

port processor to core v3 #130

wants to merge 25 commits into from

Conversation

kba
Copy link
Contributor

@kba kba commented Aug 23, 2024

With this PR, eynollah supports OCR-D/core#1240. It simplifies it a lot too.

I'll update the ocrd-tool.json with the changed/added flags here as well.

Draft, please don't merge until v3 stable is released

Copy link
Contributor

@bertsky bertsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks – LGTM!

Have not tested yet, though.

Current main also looks very promising – will give it a try myself

qurator/eynollah/processor.py Outdated Show resolved Hide resolved
Comment on lines 53 to 54
image_filename=page.imageFilename,
image_pil=page_image
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: that filename might not be where that image came from in workspace.image_from_page. It could well be a derived image generated by some previous processor (just not a cropped, deskewed or binarized image, because that would have changed its coordinate system).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still a bit hazy for me when image_filename is actually used. Ideally, image_pil should take preference and image_filename is only for the plotter/writer, at least in the "single image mode" we're using.

One of the aspects I hope I'll be able to improve a bit with https://github.com/qurator-spk/eynollah/tree/refactoring-2024-08/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can also re-use session across Eynollah invokations in addition to models?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, yes, but with standalone eynollah being focused on batch processing now, I am honestly not sure how/where sessions are defined for the non-dir_in option - @vahidrezanezhad can you tell us?

qurator/eynollah/processor.py Outdated Show resolved Hide resolved
Comment on lines 26 to 30
# if not('://' in page.imageFilename):
# image_filename = next(self.workspace.mets.find_files(local_filename=page.imageFilename)).local_filename
# else:
# # could be a URL with file:// or truly remote
# image_filename = self.workspace.download_file(next(self.workspace.mets.find_files(url=page.imageFilename))).local_filename
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# if not('://' in page.imageFilename):
# image_filename = next(self.workspace.mets.find_files(local_filename=page.imageFilename)).local_filename
# else:
# # could be a URL with file:// or truly remote
# image_filename = self.workspace.download_file(next(self.workspace.mets.find_files(url=page.imageFilename))).local_filename

This whole effort was to ensure we can pass a working local filename, as (was) needed by Eynollah. The approach by OCR-D is Workspace.image_from_page / Workspace.image_from_segment which will search for the right original or derived image, download it if necessary and load it into memory.

I don't recall what the new behaviour of Eynollah is. If both an image filename and an image object are passed, who wins?

Assuming it's the memory object: this can be removed. (But then I wonder why we still pass the image filename at all...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, currently we have

            if image_pil:
                self._imgs = self._cache_images(image_pil=image_pil)
            else:
                self._imgs = self._cache_images(image_filename=image_filename)
[...]
     def _cache_images(self, image_filename=None, image_pil=None):
         ret = {}
         if image_filename:
             ret['img'] = cv2.imread(image_filename)
             self.dpi = check_dpi(image_filename)
         else:
             ret['img'] = pil2cv(image_pil)
             self.dpi = check_dpi(image_pil)

image_filename is (should) then only used passively, to generate filenames of plotted debug images as well as for PAGE serialization.

So I think image_pil should win but for now we need both. But as I said above, one of those things I would love to untangle in the refactoring.

@bertsky
Copy link
Contributor

bertsky commented Sep 2, 2024

BTW, I just tested under (METS Server and) OCRD_MAX_PARALLEL_PAGES=2 – it works, but you need lots of GPU memory, otherwise GPU OOM happens. (It does work with CUDA_VISIBLE_DEVICES=, but of course the CPU utilization grows, so that might stall the system.)

I'm not sure if this warrants adding max_workers = 1 to EynollahProcessor ...

# Conflicts:
#	pyproject.toml
#	src/eynollah/cli.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants