-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory usage explosion with very narrow images (e.g. book spine) #67
Comments
While eynollah should handle this gracefully, we should also consider how to handle irrelevant images that are already marked as such in the METS
(Full document: PPN894261851.zip) |
Yes, it should be possible to skip pages marked as certain types in the logical structmap – not just in any one processor, but as a general mechanism for workflows in OCR-D. For the concrete set of supported page types, we should stick to DFG Strukturdatenset, which is strangely missing This set is also partially supported by {'annotation': 0, 'binding': 1, 'chapter': 2, 'colour_checker': 3, 'contained_work': 4, 'contents': 5, 'cover': 6, 'edge': 7, 'endsheet': 8, 'epicedia': 9, 'illustration': 10, 'index': 11, 'musical_notation': 12, 'page': 13, 'paste_down': 14, 'preface': 15, 'provenance': 16, 'section': 17, 'sermon': 18, 'table': 19, 'title_page': 20} For the general mechanism, I suggest something along the lines of our |
100% agree! Should we take this to an OCR-D core or spec issue? I have some additional thoughts to discuss (like: What happens with skipped pages in the output?) |
Yes, we should elevate this to OCR-D/spec.
There is already some discussion on skip strategies for API changes in spec... |
With the current version including #67 I was able to
Is there anything relevant from here that is still needed for OCR-D/spec#172 (comment) or can we close this? |
I wouldn't know, the current version is not working for OCR-D and so I can't reproduce until it's fixed. (Yes, there is a elaborate workaround but I am not willing to invest the time to reproduce with a lengthy changeset (#86) missing.) |
With this document (PPN894261851.zip) we experienced an OOM error. Further investigation revealed this memory usage (measured using procpath):
The culprit seems to be this "page" from the document - an image of a book spine:
Relevant parts from the log output:
This log output is not from the OOM, but another run I did on a different machine to investigate the problem. If I interpret the
cont_page
part correctly, the image is blown up to[ 4404, 27685],
which would certainly explain the OOM error on the other machine.Reproduce with
ocrd-eynollah-segment -I MAX -O TEST-SEGMENT -P models /path/to/models
.The text was updated successfully, but these errors were encountered: