Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

empty OCR #412

Closed
jbarth-ubhd opened this issue Mar 1, 2024 · 13 comments
Closed

empty OCR #412

jbarth-ubhd opened this issue Mar 1, 2024 · 13 comments

Comments

@jbarth-ubhd
Copy link

with this workflow

singocrd ocrd workspace init
singocrd ocrd workspace add -g P_00001 -G OCR-D-IMG -i OCR-D-IMG_00001 -m image/
►tiff OCR-D-IMG/00001.tif
singocrd ocrd-sbb-binarize -P model default-2021-03-09 -I OCR-D-IMG -O OCR-D-001
singocrd ocrd-anybaseocr-crop -I OCR-D-001 -O OCR-D-002
singocrd ocrd-olena-binarize -P impl wolf -P k 0.10 -I OCR-D-002 -O OCR-D-003
singocrd ocrd-cis-ocropy-deskew -P level-of-operation page -I OCR-D-003 -O OCR-D
►-004
singocrd ocrd-tesserocr-segment -P find_tables true -P shrink_polygons true -I
► OCR-D-004 -O OCR-D-005
singocrd ocrd-calamari-recognize -P checkpoint_dir $HOME/ocrd_models/ocrd-
►calamari-recognize/qurator-gt4histocr-1.0 -I OCR-D-005 -O OCR-D-OCR

there is no text in OCR-D-OCR*.xml

All files (see run.sh for workflow and ocrd.log for log):

https://digi.ub.uni-heidelberg.de/diglitData/v/christliche_kunstblaetter1862--08--empty-ocr.zip

@bertsky
Copy link
Collaborator

bertsky commented Mar 1, 2024

@jbarth-ubhd
Copy link
Author

Wrong permissions after scp?! ... Please try again.

@jbarth-ubhd
Copy link
Author

jbarth-ubhd commented Mar 1, 2024

OCR-D-OCR...xml is missing in zip archive, therefor I post it here:

<?xml version="1.0" encoding="UTF-8"?>
<pc:PcGts xmlns:pc="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07
►-15" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="
►http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15 http://schema.
►primaresearch.org/PAGE/gts/pagecontent/2019-07-15/pagecontent.xsd" pcGtsId="OCR-
►D-OCR_00001.IMG-BIN.IMG-CROP.IMG-DESKEW.IMG-BIN">
    <pc:Metadata>
        <pc:Creator>OCR-D/core 2.63.0</pc:Creator>
        <pc:Created>2024-03-01T12:48:46.945457</pc:Created>
        <pc:LastChange>2024-03-01T12:48:46.945457</pc:LastChange>
        <pc:MetadataItem type="processingStep" name="recognition/text-
►recognition" value="ocrd-calamari-recognize">
            <pc:Labels externalModel="ocrd-tool" externalId="parameters">
                <pc:Label value="/home/hd/hd_hd/hd_wu120/ocrd_models/ocrd-
►calamari-recognize/qurator-gt4histocr-1.0" type="checkpoint_dir"/>
                <pc:Label value="confidence_voter_default_ctc" type="voter"/>
                <pc:Label value="line" type="textequiv_level"/>
                <pc:Label value="0.001" type="glyph_conf_cutoff"/>
            </pc:Labels>
            <pc:Labels externalModel="ocrd-tool" externalId="version">
                <pc:Label value="1.0.6 (calamari 1.0.6, tensorflow 2.13.1)" type
►="ocrd-calamari-recognize"/>
                <pc:Label value="2.63.0" type="ocrd/core"/>
            </pc:Labels>
        </pc:MetadataItem>
    </pc:Metadata>
    <pc:Page imageFilename="OCR-D-005/OCR-D-005_00001.IMG-BIN.IMG-CROP.IMG-
►DESKEW.IMG-BIN.png" imageWidth="2229" imageHeight="2942"/>
</pc:PcGts>

@bertsky
Copy link
Collaborator

bertsky commented Mar 1, 2024

<pc:Page imageFilename="OCR-D-005/OCR-D-005_00001.IMG-BIN.IMG-CROP.IMG-DESKEW.IMG-BIN.png" imageWidth="2229" imageHeight="2942"/>

That says it all. We are chasing the same bug (regression) that haunts us everywhere now, see OCR-D/ocrd_tesserocr#201. (Last I checked, I could not reproduce though.)

@mikegerber
Copy link
Contributor

mikegerber commented Mar 1, 2024

This has the same invalid physical structMap we saw elsewhere:

  <mets:structMap TYPE="PHYSICAL">
    <mets:div TYPE="physSequence">
      <mets:div TYPE="page" ID="P_00001">
        <mets:fptr FILEID="OCR-D-IMG_00001"/>
      </mets:div>
      <mets:div TYPE="page" ID="P_00001">
        <mets:fptr FILEID="OCR-D-001_00001.IMG-BIN"/>
      </mets:div>
      <mets:div TYPE="page" ID="P_00001">
        <mets:fptr FILEID="OCR-D-002_00001.IMG-BIN.IMG-CROP"/>
      </mets:div>
      <mets:div TYPE="page" ID="P_00001">
        <mets:fptr FILEID="OCR-D-003_00001.IMG-BIN.IMG-CROP-BIN_wolf"/>
      </mets:div>
      <mets:div TYPE="page" ID="P_00001">
        <mets:fptr FILEID="OCR-D-003_00001.IMG-BIN.IMG-CROP"/>
      </mets:div>
      <mets:div TYPE="page" ID="P_00001">
        <mets:fptr FILEID="OCR-D-004_00001.IMG-BIN.IMG-CROP.IMG-DESKEW"/>
      </mets:div>
      <mets:div TYPE="page" ID="P_00001">
        <mets:fptr FILEID="OCR-D-005_00001.IMG-BIN.IMG-CROP.IMG-DESKEW.IMG-BIN"/>
      </mets:div>
      <mets:div TYPE="page" ID="P_00001">
        <mets:fptr FILEID="OCR-D-OCR_00001.IMG-BIN.IMG-CROP.IMG-DESKEW.IMG-BIN"/>
      </mets:div>
    </mets:div>
  </mets:structMap>

@jbarth-ubhd Did you produce this with an ocrd/all Docker image?

@jbarth-ubhd
Copy link
Author

With this ocrd.sif from docker ocrd/all maximum : 8687316992 2024-02-21 15:30:33 +0100 ocrd.sif

@mikegerber
Copy link
Contributor

I was using roughly the same version, I think. I have no experience with singularity but i was using the maximum image from a few days ago.

@mikegerber
Copy link
Contributor

<pc:Page imageFilename="OCR-D-005/OCR-D-005_00001.IMG-BIN.IMG-CROP.IMG-DESKEW.IMG-BIN.png" imageWidth="2229" imageHeight="2942"/>

That says it all. We are chasing the same bug (regression) that haunts us everywhere now, see OCR-D/ocrd_tesserocr#201. (Last I checked, I could not reproduce though.)

@bertsky Just out of curiosity: What is wrong with that part of the XML?

@bertsky
Copy link
Collaborator

bertsky commented Mar 1, 2024

@mikegerber

Just out of curiosity: What is wrong with that part of the XML?

that the original image is referencing the derived image (from deskewing). It's essentially what happens if the METS is broken in the way your snippet shows.

I can reproduce this now – even without workspace add.

@bertsky
Copy link
Collaborator

bertsky commented Mar 1, 2024

I can now say that it's a caching issue. If I run with OCRD_METS_CACHING=0, then the problem disappears.

The default in the Docker builds is now OCRD_METS_CACHING=1:

ENV OCRD_METS_CACHING=1

@jbarth-ubhd
Copy link
Author

Did add this to my singularity ocrd.env, helps.

@mikegerber
Copy link
Contributor

This is OCR-D/core#1195

@bertsky
Copy link
Collaborator

bertsky commented Mar 8, 2024

It required a new core v2.63.3 to appear on PyPI, then a rebuild of ocrd/core and then of ocrd/all:* before this was actually fixed.

@bertsky bertsky closed this as completed Mar 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants