-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
checksum and/or file size of models in .PAGE.xml #1183
Comments
You mean as in <mets:agent TYPE="OTHER" OTHERTYPE="SOFTWARE" ROLE="OTHER" OTHERROLE="layout/segmentation/region">
<mets:name>ocrd-tesserocr-recognize v0.17.0 (tesseract 5.3.1-25-gcf23)</mets:name>
<mets:note xmlns:ocrd="https://ocr-d.de" ocrd:option="input-file-grp">OCR-D-BIN</mets:note>
<mets:note xmlns:ocrd="https://ocr-d.de" ocrd:option="output-file-grp">OCR-D-BIN-OCR-TESS-frak2021</mets:note>
<mets:note xmlns:ocrd="https://ocr-d.de" ocrd:option="parameter">{"model": "frak2021", "dpi": 0, "padding": 0, "segmentat
ion_level": "word", "textequiv_level": "word", "overwrite_segments": false, "overwrite_text": true, "shrink_polygons": false, "
block_polygons": false, "find_tables": true, "find_staves": false, "sparse_text": false, "raw_lines": false, "char_whitelist":
"", "char_blacklist": "", "char_unblacklist": "", "tesseract_parameters": {}, "xpath_parameters": {}, "xpath_model": {}, "auto_
model": false, "oem": "DEFAULT"}</mets:note>
<mets:note xmlns:ocrd="https://ocr-d.de" ocrd:cksum="1509050540 3421140"/>
<mets:note xmlns:ocrd="https://ocr-d.de" ocrd:option="page-id"/>
</mets:agent> @jbarth-ubhd, or did you mean the PAGE XML? In METS, we could also use some information on processing dates, e.g. |
That's a great idea! Incidentally, we're in the process of dealing with the reality of mass OCR, i.e. what to throw away to keep the amount of data manageable while still retaining as much reproducibility information as possible. This would help. The tricky part is how and what to hash. A simple solution would be to assume the checksum is related to the raw data that A helpful side effect would be that we notice when models are updated at the same URL (e.g. the messy situation with eynollah currently). |
Indeed, I will need to improve the page-to-alto conversion soon-ish and find better solution for dates (I haven't forgotten about the feedback on OCR-D/page-to-alto#37 btw) and other metadata. If we had more granular and easier to interpret date info that would help a lot. |
@bertsky: I don't have cksum in mets.xml (and not in OCR-D-OCR_00001.xml) (installed ocrd/all docker a few weeks ago): <mets:agent TYPE="OTHER" OTHERTYPE="SOFTWARE" ROLE="OTHER" OTHERROLE=
"layout/segmentation/region">
<mets:name>ocrd-tesserocr-recognize v0.17.0 (tesseract 5.3.3)</mets:name>
<mets:note xmlns:ocrd="https://ocr-d.de" ocrd:option=
"input-file-grp">OCR-D-005</mets:note>
<mets:note xmlns:ocrd="https://ocr-d.de" ocrd:option=
"output-file-grp">OCR-D-OCR</mets:note>
<mets:note xmlns:ocrd="https://ocr-d.de" ocrd:option="parameter">{"textequiv_level":
"word", "segmentation_level": "region", "overwrite_segments": true, "model": "frak2021",
"dpi": 0, "padding": 0, "overwrite_text": true, "shrink_polygons": false,
"block_polygons": false, "find_tables": true, "find_staves": false, "sparse_text": false,
"raw_lines": false, "char_whitelist": "", "char_blacklist": "", "char_unblacklist": "",
"tesseract_parameters": {}, "xpath_parameters": {}, "xpath_model": {}, "auto_model":
false, "oem": "DEFAULT"}</mets:note>
<mets:note xmlns:ocrd="https://ocr-d.de" ocrd:option="page-id"/>
</mets:agent>
|
I don't understand – wouldn't that be the repository side (ocrd-tool.json), rather than the user side (resources.yml)? We could certainly have resmgr store that information, but what about manual ( I would rather like the processor to look at the file exactly when it is used, i.e. during |
Isn't that a separate issue though? In The METS side is independent, though. |
Yeah, that's the more robust and elegant solution 👍
Yeah, sry, it's late. We had a call on that subject (getting OCR and metadata into digital library) today, so it came to mind.
@jbarth-ubhd This was just a proposal by @bertsky how it could finally look, not the current situation. We'll still need to implement |
for reproducibility, it would be nice to have a checksum and/or file size of models used in XML.
The text was updated successfully, but these errors were encountered: