Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: OCR-D/ocrd_calamari
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: v0.0.5
Choose a base ref
...
head repository: OCR-D/ocrd_calamari
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: master
Choose a head ref

Commits on Dec 5, 2019

  1. Copy the full SHA
    d8db405 View commit details
  2. Copy the full SHA
    f20eb3b View commit details
  3. Copy the full SHA
    377466a View commit details

Commits on Feb 12, 2020

  1. Copy the full SHA
    0c9e1f1 View commit details
  2. Copy the full SHA
    0334a35 View commit details

Commits on Feb 13, 2020

  1. Copy the full SHA
    62e5e0c View commit details
  2. Copy the full SHA
    69df78b View commit details
  3. v0.0.6

    mikegerber committed Feb 13, 2020
    Copy the full SHA
    123ee61 View commit details
  4. 📄 Update license (Fixes #35)

    Set copyright owner name. Also, going along the lines of "update the year when substantial revision of the work happenend", set the copyright years. The latter may be not be necessary, because "life of author + 70 years" or something.
    mikegerber authored Feb 13, 2020

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
    Copy the full SHA
    fb53884 View commit details

Commits on May 31, 2020

  1. Copy the full SHA
    e03ff40 View commit details

Commits on Jun 4, 2020

  1. Merge pull request #39 from OCR-D/dont-install-test

    setup.py: exclude "test", not "tests", from installation
    mikegerber authored Jun 4, 2020

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
    Copy the full SHA
    c6ced9b View commit details

Commits on Jul 21, 2020

  1. Copy the full SHA
    7dff778 View commit details
  2. Copy the full SHA
    8ab57e4 View commit details
  3. 🐛 Fix test file path

    mikegerber committed Jul 21, 2020
    Copy the full SHA
    027fcd7 View commit details
  4. Copy the full SHA
    9ea50e2 View commit details
  5. Copy the full SHA
    7584d01 View commit details

Commits on Jul 22, 2020

  1. Copy the full SHA
    93190fa View commit details

Commits on Jul 23, 2020

  1. Copy the full SHA
    4eb4f97 View commit details
  2. Copy the full SHA
    d9afb05 View commit details
  3. Copy the full SHA
    0a9dbd0 View commit details

Commits on Aug 6, 2020

  1. Copy the full SHA
    046e3e8 View commit details
  2. Set pcGtsId

    Newest OCR-D validation checks PAGE-XML pcGtsId against METS file/@id.
    Set the pcGtsId here correctly.
    
    Fixes #40.
    mikegerber committed Aug 6, 2020
    Copy the full SHA
    7da45a0 View commit details
  3. Merge pull request #41 from OCR-D/fix/set-pcgtsid

    Set pcGtsId
    mikegerber authored Aug 6, 2020

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
    Copy the full SHA
    4c85b83 View commit details
  4. 📦 v0.0.7

    mikegerber committed Aug 6, 2020
    Copy the full SHA
    8641011 View commit details
  5. Copy the full SHA
    f6dfedf View commit details
  6. Copy the full SHA
    f746b73 View commit details
  7. Merge pull request #42 from OCR-D/file-ids-and-such

    use make_file_id and assert_file_grp_cardinality
    mikegerber authored Aug 6, 2020

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
    Copy the full SHA
    210c126 View commit details

Commits on Sep 3, 2020

  1. Copy the full SHA
    c417a0a View commit details
  2. Copy the full SHA
    7705374 View commit details
  3. Copy the full SHA
    bb9b1ab View commit details

Commits on Sep 24, 2020

  1. getLogger per method

    kba committed Sep 24, 2020
    Copy the full SHA
    e4982af View commit details

Commits on Oct 1, 2020

  1. Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
    Copy the full SHA
    3156121 View commit details
  2. Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
    Copy the full SHA
    a5d46f0 View commit details
  3. Merge pull request #45 from OCR-D/getlogger

    getLogger per method
    mikegerber authored Oct 1, 2020

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
    Copy the full SHA
    04e950a View commit details

Commits on Oct 2, 2020

  1. Copy the full SHA
    af211d2 View commit details

Commits on Oct 8, 2020

  1. Copy the full SHA
    795826f View commit details

Commits on Nov 25, 2020

  1. Copy the full SHA
    0e59c23 View commit details
  2. Copy the full SHA
    8fcd331 View commit details
  3. Copy the full SHA
    15bcfde View commit details
  4. 📦 v1.0.0

    mikegerber committed Nov 25, 2020
    Copy the full SHA
    448a5b0 View commit details
  5. Copy the full SHA
    1c7fcda View commit details

Commits on Dec 17, 2020

  1. Copy the full SHA
    df53087 View commit details
  2. Copy the full SHA
    fe973e5 View commit details

Commits on Dec 22, 2020

  1. Copy the full SHA
    83adfcf View commit details
  2. fix typos

    kba committed Dec 22, 2020
    Copy the full SHA
    d6804bd View commit details
  3. Copy the full SHA
    fdd30eb View commit details

Commits on Dec 28, 2020

  1. Copy the full SHA
    00e43b1 View commit details

Commits on Jan 15, 2021

  1. Merge pull request #50 from OCR-D/add-calamari-version

    add version of calamari in --version output
    mikegerber authored Jan 15, 2021

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
    Copy the full SHA
    962f115 View commit details

Commits on Jan 19, 2021

  1. Merge pull request #52 from OCR-D/checkpoint_dir

    Checkpoint dir
    mikegerber authored Jan 19, 2021

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
    Copy the full SHA
    e7fb432 View commit details

Commits on Jan 20, 2021

  1. Merge pull request #49 from OCR-D/fix-48

    check for empty line image, ht @andbue, fix #48
    mikegerber authored Jan 20, 2021

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
    Copy the full SHA
    a014bab View commit details
37 changes: 27 additions & 10 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
@@ -1,25 +1,42 @@
version: 2.1
orbs:
codecov: codecov/codecov@1.0.5
codecov: codecov/codecov@3.3.0

jobs:

build-python36:
test:
parameters:
python-image:
type: string
docker:
- image: ubuntu:18.04
- image: << parameters.python-image >>
environment:
- LC_ALL=C.UTF-8
- PYTHONIOENCODING: utf-8
steps:
- run: apt-get update ; apt-get install -y make git curl python3 python3-pip wget imagemagick locales
- run: locale-gen "en_US.UTF-8"; update-locale LC_ALL="en_US.UTF-8"
- checkout
- restore_cache:
keys:
- v01-pydeps-<< parameters.python-image >>-{{ checksum "requirements.txt" }}-{{ checksum "requirements-dev.txt" }}
- v01-pydeps-<< parameters.python-image >>
paths:
- "~/.cache/pip"
- run: pip3 install --upgrade pip
- run: make install PIP_INSTALL="pip3 install"
- run: pip3 install -r requirements-test.txt
- run: make coverage LC_ALL=en_US.utf8
- run: make install deps-test-ubuntu PIP_INSTALL="pip3 install"
- run: make coverage
- codecov/upload
- save_cache:
key: v01-pydeps-<< parameters.python-image >>-{{ checksum "requirements.txt" }}-{{ checksum "requirements-dev.txt" }}
paths:
- "~/.cache/pip"

workflows:
build:
jobs:
- build-python36
- test:
filters:
branches:
ignore:
- screenshots
matrix:
parameters:
python-image: ["python:3.8", "python:3.9", "python:3.10", "python:3.11"]
14 changes: 0 additions & 14 deletions .coveragerc

This file was deleted.

29 changes: 29 additions & 0 deletions .editorconfig
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
root = true

[*]
charset = utf-8
end_of_line = lf
indent_size = 4
indent_style = space
insert_final_newline = true
trim_trailing_whitespace = true
max_line_length = 88
tab_width = 4

[{*.cfg, *.ini, *.html, *.yaml, *.yml}]
indent_size = 2

[*.json]
indent_size = 2
insert_final_newline = true

# trailing spaces in markdown indicate word wrap
[*.md]
trim_trailing_whitespace = false

[*.py]
multi_line_output = 3
include_trailing_comma = True
force_grid_wrap = 0
use_parentheses = True
ensure_newline_before_comments = True
44 changes: 44 additions & 0 deletions .github/workflows/docker.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
name: CD

on:
push:
branches: [ "master" ]
workflow_dispatch: # run manually

jobs:

build:
runs-on: ubuntu-latest
permissions:
packages: write
contents: read
steps:
- name: Checkout
uses: actions/checkout@v4
with:
# we need tags for docker version tagging
fetch-tags: true
fetch-depth: 0
- # Activate cache export feature to reduce build time of images
name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Login to GitHub Container Registry
uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Log in to Docker Hub
uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKERIO_USERNAME }}
password: ${{ secrets.DOCKERIO_PASSWORD }}
- name: Build the Docker image
# build both tags at the same time
run: make docker DOCKER_TAG="docker.io/ocrd/calamari -t ghcr.io/ocr-d/calamari"
- name: Test the Docker image
run: docker run --rm ocrd/calamari ocrd-calamari-recognize -h
- name: Push to Dockerhub
run: docker push docker.io/ocrd/calamari
- name: Push to Github Container Registry
run: docker push ghcr.io/ocr-d/calamari
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -110,3 +110,4 @@ venv.bak/
/actevedef_718448162*
/repo
/test/assets
gt4histocr-calamari*
2 changes: 1 addition & 1 deletion .idea/.gitignore

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion .idea/inspectionProfiles/profiles_settings.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion .idea/misc.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion .idea/modules.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion .idea/ocrd_calamari.iml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion .idea/vcs.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

32 changes: 32 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
repos:
- hooks:
- id: trailing-whitespace
- id: end-of-file-fixer
- id: check-json
- id: check-toml
- id: check-yaml
- id: check-added-large-files
- id: check-ast
repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.6.0
- hooks:
- id: black
repo: https://github.com/psf/black
rev: 24.4.2
- hooks:
- args:
- --fix
- --exit-non-zero-on-fix
id: ruff
repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.5.3
- hooks:
- additional_dependencies:
- types-setuptools
id: mypy
repo: https://github.com/pre-commit/mirrors-mypy
rev: v1.10.1
- hooks:
- id: pre-commit-update
repo: https://gitlab.com/vojko.pribudic.foss/pre-commit-update
rev: v0.3.3post1
34 changes: 24 additions & 10 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,21 +1,35 @@
FROM ocrd/core
MAINTAINER OCR-D
ARG DOCKER_BASE_IMAGE
FROM $DOCKER_BASE_IMAGE
ARG VCS_REF
ARG BUILD_DATE
LABEL \
maintainer="https://ocr-d.de/kontakt" \
org.label-schema.vcs-ref=$VCS_REF \
org.label-schema.vcs-url="https://github.com/OCR-D/ocrd_calamari" \
org.label-schema.build-date=$BUILD_DATE \
org.opencontainers.image.vendor="DFG-Funded Initiative for Optical Character Recognition Development" \
org.opencontainers.image.title="ocrd_calamari" \
org.opencontainers.image.description="OCR-D compliant workspace processor for the functionality of Calamari OCR" \
org.opencontainers.image.source="https://github.com/OCR-D/ocrd_calamari" \
org.opencontainers.image.documentation="https://github.com/OCR-D/ocrd_calamari/blob/${VCS_REF}/README.md" \
org.opencontainers.image.revision=$VCS_REF \
org.opencontainers.image.created=$BUILD_DATE \
org.opencontainers.image.base.name=$DOCKER_BASE_IMAGE
ENV DEBIAN_FRONTEND noninteractive
ENV PYTHONIOENCODING utf8
ENV LC_ALL C.UTF-8
ENV LANG C.UTF-8

WORKDIR /build
WORKDIR /build/calamari
COPY Makefile .
COPY setup.py .
COPY pyproject.toml .
COPY ocrd-tool.json .
COPY requirements.txt .
COPY README.md .
COPY ocrd_calamari ocrd_calamari
COPY ocrd_calamari ./ocrd_calamari
RUN make install
RUN rm -rf /build/calamari

RUN pip3 install --upgrade pip && \
pip3 install . && \
pip3 check

ENTRYPOINT ["/usr/local/bin/ocrd-calamari-recognize"]

WORKDIR /data
VOLUME ["/data"]
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -186,7 +186,7 @@
same "printed page" as the copyright notice for easier
identification within third-party archives.

Copyright [yyyy] [name of copyright owner]
Copyright 2018-2020 Konstantin Baierer, Mike Gerber

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
58 changes: 39 additions & 19 deletions Makefile
Original file line number Diff line number Diff line change
@@ -3,28 +3,37 @@ PIP_INSTALL = pip3 install
GIT_CLONE = git clone
PYTHON = python3
PYTEST_ARGS = -W 'ignore::DeprecationWarning' -W 'ignore::FutureWarning'
MODEL = qurator-gt4histocr-1.0
EXAMPLE = actevedef_718448162.first-page+binarization+segmentation

# BEGIN-EVAL makefile-parser --make-help Makefile

DOCKER_BASE_IMAGE = docker.io/ocrd/core-cuda-tf2:v2.70.0
DOCKER_TAG = 'ocrd/calamari'

help:
@echo ""
@echo " Targets"
@echo ""
@echo " install Install ocrd_calamari"
@echo " gt4histocr-calamari Get GT4HistOCR Calamari model (from SBB)"
@echo " actevedef_718448162 Download example data"
@echo " $(MODEL) Get Calamari model (from SBB)"
@echo " example Download example data"
@echo " deps-test Install testing python deps via pip"
@echo " repo/assets Clone OCR-D/assets to ./repo/assets"
@echo " test/assets Setup test assets"
@echo " assets-clean Remove symlinks in test/assets"
@echo " test Run unit tests"
@echo " coverage Run unit tests and determine test coverage"
@echo " docker Build Docker image"
@echo ""
@echo " Variables"
@echo ""
@echo " PYTHON '$(PYTHON)'"
@echo " PIP_INSTALL '$(PIP_INSTALL)'"
@echo " GIT_CLONE '$(GIT_CLONE)'"
@echo " PYTHON '$(PYTHON)'"
@echo " PIP_INSTALL '$(PIP_INSTALL)'"
@echo " GIT_CLONE '$(GIT_CLONE)'"
@echo " MODEL '$(MODEL)'"
@echo " DOCKER_TAG '$(DOCKER_TAG)'"
@echo " DOCKER_BASE_IMAGE '$(DOCKER_BASE_IMAGE)'"

# END-EVAL

@@ -34,17 +43,18 @@ install:


# Get GT4HistOCR Calamari model (from SBB)
gt4histocr-calamari:
mkdir gt4histocr-calamari
cd gt4histocr-calamari && \
wget https://qurator-data.de/calamari-models/GT4HistOCR/model.tar.xz && \
tar xfv model.tar.xz && \
rm model.tar.xz

# Download example data
actevedef_718448162:
wget https://qurator-data.de/examples/actevedef_718448162.zip && \
unzip actevedef_718448162.zip
$(MODEL):
ocrd resmgr download ocrd-calamari-recognize $@

# Download example data (for the README)
example: $(EXAMPLE)

$(EXAMPLE):
wget -c https://qurator-data.de/examples/$(EXAMPLE).zip -O $(EXAMPLE).zip.tmp
mv $(EXAMPLE).zip.tmp $(EXAMPLE).zip
unzip $(EXAMPLE).zip
rm $(EXAMPLE).zip



@@ -54,7 +64,10 @@ actevedef_718448162:

# Install testing python deps via pip
deps-test:
$(PIP) install -r requirements_test.txt
$(PIP_INSTALL) -r requirements-dev.txt

deps-test-ubuntu: deps-test
apt-get install -y make git curl wget imagemagick


# Clone OCR-D/assets to ./repo/assets
@@ -73,15 +86,22 @@ assets-clean:
rm -rf test/assets

# Run unit tests
test: test/assets gt4histocr-calamari
test: test/assets $(MODEL)
# declare -p HTTP_PROXY
$(PYTHON) -m pytest --continue-on-collection-errors test $(PYTEST_ARGS)

# Run unit tests and determine test coverage
coverage: test/assets gt4histocr-calamari
coverage: test/assets $(MODEL)
coverage erase
make test PYTHON="coverage run"
coverage report
coverage html

.PHONY: assets-clean test
docker:
docker build \
--build-arg DOCKER_BASE_IMAGE=$(DOCKER_BASE_IMAGE) \
--build-arg VCS_REF=$$(git rev-parse --short HEAD) \
--build-arg BUILD_DATE=$$(date -u +"%Y-%m-%dT%H:%M:%SZ") \
-t $(DOCKER_TAG) .

.PHONY: install assets-clean deps-test test coverage $(MODEL) example docker
24 changes: 17 additions & 7 deletions README-DEV.md
Original file line number Diff line number Diff line change
@@ -2,21 +2,31 @@ Testing
-------
In a Python 3 virtualenv:

~~~
```
pip install -e .
pip install -r requirements-test.txt
pip install -r requirements-dev.txt
make test
~~~
```

Releasing
---------
* Update `ocrd-tool.json` version
* Update `setup.py` version
* `git commit -m 'v<version>'`
* Update `ocrd-tool.json` version (the `setup.py` version is read from this)
* `git add` the `ocrd-tool.json` file and `git commit -m 'v<version>'`
* `git tag -m 'v<version>' 'v<version>'`
* `git push --tags`
* `git push; git push --tags`
* Wait and check if tests on CircleCI are OK
* Do a release on GitHub

### Uploading to PyPI
* `rm -rf dist/` or backup if `dist/` exists already
* In the virtualenv: `python setup.py sdist bdist_wheel`
* `twine upload dist/ocrd_calamari-<version>*`


How to use pre-commit
---------------------

This project optionally uses [pre-commit](https://pre-commit.com) to check commits. To use it:

- Install pre-commit, e.g. `pip install -r requirements-dev.txt`
- Install the repo-local git hooks: `pre-commit install`
39 changes: 21 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
@@ -8,9 +8,9 @@

## Introduction

This offers a OCR-D compliant workspace processor for the functionality of Calamari OCR.
**ocrd_calamari** offers a [OCR-D](https://ocr-d.de) compliant workspace processor for the functionality of Calamari OCR. It uses OCR-D workspaces (METS) with [PAGE XML](https://github.com/PRImA-Research-Lab/PAGE-XML) documents as input and output.

This processor only operates on the text line level and so needs a line segmentation (and by extension a binarized
This processor only operates on the text line level and so needs a line segmentation (and by extension a binarized
image) as its input.

In addition to the line text it may also output word and glyph segmentation
@@ -22,15 +22,17 @@ segmentation and the glyph positions. The provided glyph and word segmentation
can be used for text extraction and highlighting, but is probably not useful for
further image-based processing.

![Example output as viewed in PAGE Viewer](https://github.com/OCR-D/ocrd_calamari/raw/screenshots/output-in-page-viewer.jpg)

## Installation

### From PyPI

```
```sh
pip install ocrd_calamari
```

### From Repo
### From the git repository

```sh
pip install .
@@ -41,28 +43,29 @@ pip install .
Download models trained on GT4HistOCR data:

```
make gt4histocr-calamari
ls gt4histocr-calamari
make qurator-gt4histocr-1.0
ls .local/share/ocrd-resources/ocrd-calamari-recognize/*
```

Manual download: [model.tar.xz](https://qurator-data.de/calamari-models/GT4HistOCR/2019-12-11T11_10+0100/model.tar.xz)

## Example Usage
Before using `ocrd-calamari-recognize` get some example data and model, and
prepare the document for OCR:
Before using `ocrd-calamari-recognize` get some example data and model:

```
# Download model and example data
make gt4histocr-calamari
make actevedef_718448162
# Create binarized images and line segmentation using other OCR-D projects
cd actevedef_718448162
ocrd-olena-binarize -p '{ "impl": "sauvola-ms-split" }' -I OCR-D-IMG -O OCR-D-IMG-BINPAGE,OCR-D-IMG-BIN
ocrd-tesserocr-segment-region -I OCR-D-IMG-BINPAGE -O OCR-D-SEG-REGION
ocrd-tesserocr-segment-line -I OCR-D-SEG-REGION -O OCR-D-SEG-LINE
make qurator-gt4histocr-1.0
make example
```

Finally recognize the text using ocrd_calamari and the downloaded model:
The example already contains a binarized and line-segmented page, so we are ready to go. Recognize
the text using ocrd_calamari and the downloaded model:

```
ocrd-calamari-recognize -p '{ "checkpoint": "../gt4histocr-calamari/*.ckpt.json" }' -I OCR-D-SEG-LINE -O OCR-D-OCR-CALAMARI
cd actevedef_718448162.first-page+binarization+segmentation
ocrd-calamari-recognize \
-P checkpoint_dir qurator-gt4histocr-1.0 \
-I OCR-D-SEG-LINE-SBB -O OCR-D-OCR-CALAMARI
```

You may want to have a look at the [ocrd-tool.json](ocrd_calamari/ocrd-tool.json) descriptions
4 changes: 1 addition & 3 deletions ocrd_calamari/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,3 @@
__all__ = [
'CalamariRecognize'
]
__all__ = ["CalamariRecognize"]

from .recognize import CalamariRecognize
2 changes: 1 addition & 1 deletion ocrd_calamari/cli.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import click

from ocrd.decorators import ocrd_cli_options, ocrd_cli_wrap_processor

from ocrd_calamari.recognize import CalamariRecognize


4 changes: 2 additions & 2 deletions ocrd_calamari/config.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
import json

from pkg_resources import resource_string

OCRD_TOOL = json.loads(resource_string(__name__, 'ocrd-tool.json').decode('utf8'))
TF_CPP_MIN_LOG_LEVEL = '3' # '3' == ERROR
OCRD_TOOL = json.loads(resource_string(__name__, "ocrd-tool.json").decode("utf8"))
41 changes: 41 additions & 0 deletions ocrd_calamari/fix_calamari1_model.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
import json
import re
from copy import deepcopy
from glob import glob

import click

from ocrd_calamari.util import working_directory


@click.command
@click.argument("checkpoint_dir")
def fix_calamari1_model(checkpoint_dir):
"""
Fix old Calamari 1 models.
This currently means fixing regexen in "replacements" to have their global flags
in front of the rest of the regex.
"""
with working_directory(checkpoint_dir):
for fn in glob("*.json"):
with open(fn, "r") as fp:
j = json.load(fp)
old_j = deepcopy(j)

for v in j["model"].values():
if not isinstance(v, dict):
continue
for child in v.get("children", []):
for replacement in child.get("replacements", []):
# Move global flags in front
replacement["old"] = re.sub(
r"^(.*)\(\?u\)$", r"(?u)\1", replacement["old"]
)

if j == old_j:
print(f"{fn} unchanged.")
else:
with open(fn, "w") as fp:
json.dump(j, fp, indent=2)
print(f"{fn} fixed.")
125 changes: 119 additions & 6 deletions ocrd_calamari/ocrd-tool.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"git_url": "https://github.com/kba/ocrd_calamari",
"version": "0.0.5",
"git_url": "https://github.com/OCR-D/ocrd_calamari",
"version": "1.0.6",
"tools": {
"ocrd-calamari-recognize": {
"executable": "ocrd-calamari-recognize",
@@ -18,9 +18,13 @@
"OCR-D-OCR-CALAMARI"
],
"parameters": {
"checkpoint": {
"description": "The calamari model files (*.ckpt.json)",
"type": "string", "format": "file", "cacheable": true
"checkpoint_dir": {
"description": "The directory containing calamari model files (*.ckpt.json). Uses all checkpoints in that directory",
"type": "string",
"format": "uri",
"content-type": "text/directory",
"cacheable": true,
"default": "qurator-gt4histocr-1.0"
},
"voter": {
"description": "The voting algorithm to use",
@@ -38,7 +42,116 @@
"default": 0.001,
"description": "Only include glyph alternatives with confidences above this threshold"
}
}
},
"resources": [
{
"url": "https://qurator-data.de/calamari-models/GT4HistOCR/2019-12-11T11_10+0100/model.tar.xz",
"type": "archive",
"name": "qurator-gt4histocr-1.0",
"description": "Calamari model trained with GT4HistOCR",
"size": 90275264,
"version_range": ">= 1.0.0"
},
{
"url": "https://github.com/Calamari-OCR/calamari_models_experimental/releases/download/v0.0.1-pre1/c1_fraktur19-1.tar.gz",
"type": "archive",
"name": "zpd-fraktur19",
"description": "Model trained on 19th century german fraktur",
"path_in_archive": "c1_fraktur19-1",
"size": 86009886,
"version_range": ">= 1.0.0"
},
{
"url": "https://github.com/Calamari-OCR/calamari_models_experimental/releases/download/v0.0.1-pre1/c1_latin-script-hist-3.tar.gz",
"type": "archive",
"name": "zpd-latin-script-hist-3",
"path_in_archive": "c1_latin-script-hist-3",
"description": "Model trained on historical latin-script texts",
"size": 88416863,
"version_range": ">= 1.0.0"
},
{
"url": "https://github.com/Calamari-OCR/calamari_models/releases/download/1.1/antiqua_historical.zip",
"type": "archive",
"name": "antiqua_historical",
"path_in_archive": "antiqua_historical",
"description": "Antiqua parts of GT4HistOCR from Calamari-OCR/calamari_models (5-fold ensemble, normalized grayscale, NFC)",
"size": 89615540,
"version_range": ">= 1.0.0"
},
{
"url": "https://github.com/Calamari-OCR/calamari_models/releases/download/1.1/antiqua_historical_ligs.zip",
"type": "archive",
"name": "antiqua_historical_ligs",
"path_in_archive": "antiqua_historical_ligs",
"description": "Antiqua parts of GT4HistOCR with enriched ligatures from Calamari-OCR/calamari_models (5-fold ensemble, normalized grayscale, NFC)",
"size": 87540762,
"version_range": ">= 1.0.0"
},
{
"url": "https://github.com/Calamari-OCR/calamari_models/releases/download/1.1/fraktur_19th_century.zip",
"type": "archive",
"name": "fraktur_19th_century",
"path_in_archive": "fraktur_19th_century",
"description": "Fraktur 19th century parts of GT4HistOCR mixed with Fraktur data from Archiscribe and jze from Calamari-OCR/calamari_models (5-fold ensemble, normalized grayscale and nlbin, NFC)",
"size": 83895140,
"version_range": ">= 1.0.0"
},
{
"url": "https://github.com/Calamari-OCR/calamari_models/releases/download/1.1/fraktur_historical.zip",
"type": "archive",
"name": "fraktur_historical",
"path_in_archive": "fraktur_historical",
"description": "Fraktur parts of GT4HistOCR from Calamari-OCR/calamari_models (5-fold ensemble, normalized grayscale, NFC)",
"size": 87807639,
"version_range": ">= 1.0.0"
},
{
"url": "https://github.com/Calamari-OCR/calamari_models/releases/download/1.1/fraktur_historical_ligs.zip",
"type": "archive",
"name": "fraktur_historical_ligs",
"path_in_archive": "fraktur_historical_ligs",
"description": "Fraktur parts of GT4HistOCR with enriched ligatures from Calamari-OCR/calamari_models (5-fold ensemble, normalized grayscale, NFC)",
"size": 88039551,
"version_range": ">= 1.0.0"
},
{
"url": "https://github.com/Calamari-OCR/calamari_models/releases/download/1.1/gt4histocr.zip",
"type": "archive",
"name": "gt4histocr",
"path_in_archive": "gt4histocr",
"description": "GT4HistOCR from Calamari-OCR/calamari_models (5-fold ensemble, normalized grayscale, NFC)",
"size": 90107851,
"version_range": ">= 1.0.0"
},
{
"url": "https://github.com/Calamari-OCR/calamari_models/releases/download/1.1/historical_french.zip",
"type": "archive",
"name": "historical_french",
"path_in_archive": "historical_french",
"description": "17-19th century French prints from Calamari-OCR/calamari_models (5-fold ensemble, nlbin, NFC)",
"size": 87335250,
"version_range": ">= 1.0.0"
},
{
"url": "https://github.com/Calamari-OCR/calamari_models/releases/download/1.1/idiotikon.zip",
"type": "archive",
"name": "idiotikon",
"path_in_archive": "idiotikon",
"description": "Antiqua UW3 finetuned on Antiqua Idiotikon dictionary with many diacritics from Calamari-OCR/calamari_models (5-fold ensemble, nlbin, NFD)",
"size": 100807764,
"version_range": ">= 1.0.0"
},
{
"url": "https://github.com/Calamari-OCR/calamari_models/releases/download/1.1/uw3-modern-english.zip",
"type": "archive",
"name": "uw3-modern-english",
"path_in_archive": "uw3-modern-english",
"description": "Antiqua UW3 corpus from Calamari-OCR/calamari_models (5-fold ensemble, nlbin, NFC)",
"size": 85413520,
"version_range": ">= 1.0.0"
}
]
}
}
}
585 changes: 376 additions & 209 deletions ocrd_calamari/recognize.py

Large diffs are not rendered by default.

15 changes: 15 additions & 0 deletions ocrd_calamari/util.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
import os


class working_directory:
"""Context manager to temporarily change the working directory"""

def __init__(self, wd):
self.wd = wd

def __enter__(self):
self.old_wd = os.getcwd()
os.chdir(self.wd)

def __exit__(self, etype, value, traceback):
os.chdir(self.old_wd)
93 changes: 93 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
[build-system]
requires = ["setuptools>=61.0.0", "wheel", "setuptools-ocrd"]

[project]
name = "ocrd_calamari"
authors = [
{name = "Mike Gerber", email = "mike.gerber@sbb.spk-berlin.de"},
{name = "Konstantin Baierer", email = "unixprog@gmail.com"},
]
description = "Recognize text using Calamari OCR and the OCR-D framework"
readme = "README.md"
license.file = "LICENSE"
requires-python = ">=3.8"
keywords = ["ocr", "ocr-d", "calamari-ocr"]

dynamic = ["version", "dependencies", "optional-dependencies"]

# https://pypi.org/classifiers/
classifiers = [
"Development Status :: 5 - Production/Stable",
"Environment :: Console",
"Intended Audience :: Science/Research",
"Intended Audience :: Other Audience",
"License :: OSI Approved :: Apache Software License",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3 :: Only",
"Topic :: Text Processing",
]

[project.scripts]
ocrd-calamari-recognize = "ocrd_calamari.cli:ocrd_calamari_recognize"
fix-calamari1-model = "ocrd_calamari.fix_calamari1_model:fix_calamari1_model"

[project.urls]
Homepage = "https://github.com/OCR-D/ocrd_calamari"
Repository = "https://github.com/OCR-D/ocrd_calamari.git"


[tool.setuptools.dynamic]
dependencies = {file = ["requirements.txt"]}
optional-dependencies.dev = {file = ["requirements-dev.txt"]}

[tool.setuptools.package-data]
"*" = ["*.json"]

[tool.setuptools.packages.find]
where = ["."]
include = ["ocrd_calamari"]

[tool.pytest.ini_options]
minversion = 6.0
addopts = "--strict-markers"
markers = [
"integration: integration tests",
]


[tool.mypy]
plugins = ["numpy.typing.mypy_plugin"]

ignore_missing_imports = true


strict = true

disallow_subclassing_any = false
# ❗ error: Class cannot subclass "Processor" (has type "Any")
disallow_any_generics = false
disallow_untyped_defs = false
disallow_untyped_calls = false


[tool.ruff.lint]
select = ["E", "F", "I"]


[tool.coverage.run]
branch = true
source = [
"ocrd_calamari"
]

[tool.coverage.report]
exclude_also = [
"if self\\.debug",
"pragma: no cover",
"raise NotImplementedError",
"if __name__ == .__main__.:",
]
ignore_errors = true
omit = [
"ocrd_calamari/cli.py"
]
10 changes: 10 additions & 0 deletions requirements-dev.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
pytest
coverage
pytest-cov
pytest-mypy
types-setuptools
black
pre-commit

ruff ; python_version >= "3.7"
pytest-ruff ; python_version >= "3.7"
2 changes: 0 additions & 2 deletions requirements-test.txt

This file was deleted.

6 changes: 3 additions & 3 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
tensorflow >= 2.5.0, < 2.16
numpy
tensorflow-gpu == 1.15.*
calamari-ocr == 0.3.5
calamari-ocr == 1.0.*, >= 1.0.7
setuptools >= 41.0.0 # tensorboard depends on this, but why do we get an error at runtime?
click
ocrd >= 2.2.1
ocrd >= 2.54.0
26 changes: 0 additions & 26 deletions setup.py

This file was deleted.

10 changes: 4 additions & 6 deletions test/base.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,7 @@
# pylint: disable=unused-import
from test.assets import assets

import os
import sys
from ocrd_utils import initLogging

from test.assets import assets
initLogging()

PWD = os.path.dirname(os.path.realpath(__file__))
sys.path.append(PWD + '/../ocrd')
__all__ = ["assets"]
186 changes: 124 additions & 62 deletions test/test_recognize.py
Original file line number Diff line number Diff line change
@@ -1,24 +1,49 @@
import logging
import os
import shutil
import subprocess
import urllib.request
from lxml import etree
from glob import glob
import tempfile

import pytest
from lxml import etree
from ocrd.resolver import Resolver

from ocrd_calamari import CalamariRecognize

from .base import assets

METS_KANT = assets.url_of(
"kant_aufklaerung_1784-page-region-line-word_glyph/data/mets.xml"
)
WORKSPACE_DIR = tempfile.mkdtemp(prefix="test-ocrd-calamari-")
CHECKPOINT_DIR = os.getenv("MODEL", "qurator-gt4histocr-1.0")
DEBUG = os.getenv("DEBUG", False)


def page_namespace(tree):
"""Return the PAGE content namespace used in the given ElementTree.
This relies on the assumption that, in any given PAGE content file, the root element
has the local name "PcGts". We do not check if the files uses any valid PAGE
namespace.
"""
root_name = etree.QName(tree.getroot().tag)
if root_name.localname == "PcGts":
return root_name.namespace
else:
raise ValueError("Not a PAGE tree")

METS_KANT = assets.url_of('kant_aufklaerung_1784-page-block-line-word_glyph/data/mets.xml')
WORKSPACE_DIR = '/tmp/test-ocrd-calamari'
CHECKPOINT = os.path.join(os.getcwd(), 'gt4histocr-calamari/*.ckpt.json')

# Because XML namespace versions are so much fun, we not only use one, we use TWO!
NSMAP = { "pc": "http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15" }
NSMAP_GT = { "pc": "http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15" }
def assertFileContains(fn, text):
"""Assert that the given file contains a given string."""
with open(fn, "r", encoding="utf-8") as f:
assert text in f.read()


def assertFileDoesNotContain(fn, text):
"""Assert that the given file does not contain given string."""
with open(fn, "r", encoding="utf-8") as f:
assert text not in f.read()


@pytest.fixture
@@ -28,109 +53,146 @@ def workspace():
os.makedirs(WORKSPACE_DIR)

resolver = Resolver()
workspace = resolver.workspace_from_url(METS_KANT, dst_dir=WORKSPACE_DIR)

# XXX Work around data bug(?):
# PAGE-XML links to OCR-D-IMG/INPUT_0017.tif, but this is nothing core can download
os.makedirs(os.path.join(WORKSPACE_DIR, 'OCR-D-IMG'))
for f in ['INPUT_0017.tif', 'INPUT_0020.tif']:
urllib.request.urlretrieve(
"https://github.com/OCR-D/assets/raw/master/data/kant_aufklaerung_1784/data/OCR-D-IMG/" + f,
os.path.join(WORKSPACE_DIR, 'OCR-D-IMG', f))
# due to core#809 this does not always work:
# workspace = resolver.workspace_from_url(METS_KANT, dst_dir=WORKSPACE_DIR)
# workaround:
shutil.rmtree(WORKSPACE_DIR)
shutil.copytree(os.path.dirname(METS_KANT), WORKSPACE_DIR)
workspace = resolver.workspace_from_url(os.path.join(WORKSPACE_DIR, "mets.xml"))

# The binarization options I have are:
#
# a. ocrd_kraken which tries to install cltsm, whose installation is borken on my machine (protobuf)
# b. ocrd_olena which 1. I cannot fully install via pip and 2. whose dependency olena doesn't compile on my
# machine
# a. ocrd_kraken which tries to install cltsm, whose installation is borken on my
# machine (protobuf)
# b. ocrd_olena which 1. I cannot fully install via pip and 2. whose dependency
# olena doesn't compile on my machine
# c. just fumble with the original files
#
# So I'm going for option c.
for f in ['INPUT_0017.tif', 'INPUT_0020.tif']:
ff = os.path.join(WORKSPACE_DIR, 'OCR-D-IMG', f)
subprocess.call(['convert', ff, '-threshold', '50%', ff])

# Remove GT Words and TextEquivs, to not accidently check GT text instead of the OCR text
for of in workspace.mets.find_files(fileGrp="OCR-D-GT-SEG-LINE"):
for imgf in workspace.mets.find_files(fileGrp="OCR-D-IMG"):
imgf = workspace.download_file(imgf)
path = os.path.join(workspace.directory, imgf.local_filename)
subprocess.call(["mogrify", "-threshold", "50%", path])

# Remove GT Words and TextEquivs, to not accidently check GT text instead of the
# OCR text
# XXX Review data again
for of in workspace.mets.find_files(fileGrp="OCR-D-GT-SEG-WORD-GLYPH"):
workspace.download_file(of)
for to_remove in ["//pc:Word", "//pc:TextEquiv"]:
for ff in glob(os.path.join(WORKSPACE_DIR, "OCR-D-GT-SEG-LINE", "*")):
tree = etree.parse(ff)
for e in tree.xpath(to_remove, namespaces=NSMAP_GT):
path = os.path.join(workspace.directory, of.local_filename)
tree = etree.parse(path)
nsmap_gt = {"pc": page_namespace(tree)}
for to_remove in ["//pc:Word", "//pc:TextEquiv"]:
for e in tree.xpath(to_remove, namespaces=nsmap_gt):
e.getparent().remove(e)
tree.write(ff, xml_declaration=True, encoding="utf-8")
tree.write(path, xml_declaration=True, encoding="utf-8")
assertFileDoesNotContain(path, "TextEquiv")

return workspace
yield workspace

if not DEBUG:
shutil.rmtree(WORKSPACE_DIR)


def test_recognize(workspace):
CalamariRecognize(
workspace,
input_file_grp="OCR-D-GT-SEG-LINE",
input_file_grp="OCR-D-GT-SEG-WORD-GLYPH",
output_file_grp="OCR-D-OCR-CALAMARI",
parameter={
"checkpoint": CHECKPOINT,
}
"checkpoint_dir": CHECKPOINT_DIR,
},
).process()
workspace.save_mets()

page1 = os.path.join(workspace.directory, "OCR-D-OCR-CALAMARI/OCR-D-OCR-CALAMARI_0001.xml")
page1 = os.path.join(
workspace.directory, "OCR-D-OCR-CALAMARI/OCR-D-OCR-CALAMARI_phys_0001.xml"
)
assert os.path.exists(page1)
with open(page1, "r", encoding="utf-8") as f:
assert "verſchuldeten" in f.read()
assertFileContains(page1, "verſchuldeten")


def test_recognize_should_warn_if_given_rgb_image_and_single_channel_model(
workspace, caplog
):
caplog.set_level(logging.WARNING)
CalamariRecognize(
workspace,
input_file_grp="OCR-D-GT-SEG-WORD-GLYPH",
output_file_grp="OCR-D-OCR-CALAMARI-BROKEN",
parameter={"checkpoint_dir": CHECKPOINT_DIR},
).process()

interesting_log_messages = [
t[2] for t in caplog.record_tuples if "Using raw image" in t[2]
]
assert len(interesting_log_messages) > 10 # For every line!


def test_word_segmentation(workspace):
CalamariRecognize(
workspace,
input_file_grp="OCR-D-GT-SEG-LINE",
input_file_grp="OCR-D-GT-SEG-WORD-GLYPH",
output_file_grp="OCR-D-OCR-CALAMARI",
parameter={
"checkpoint": CHECKPOINT,
"textequiv_level": "word", # Note that we're going down to word level here
}
"checkpoint_dir": CHECKPOINT_DIR,
"textequiv_level": "word", # Note that we're going down to word level here
},
).process()
workspace.save_mets()

page1 = os.path.join(workspace.directory, "OCR-D-OCR-CALAMARI/OCR-D-OCR-CALAMARI_0001.xml")
page1 = os.path.join(
workspace.directory, "OCR-D-OCR-CALAMARI/OCR-D-OCR-CALAMARI_phys_0001.xml"
)
assert os.path.exists(page1)
tree = etree.parse(page1)
nsmap = {"pc": page_namespace(tree)}

# The result should contain a TextLine that contains the text "December"
line = tree.xpath(".//pc:TextLine[pc:TextEquiv/pc:Unicode[contains(text(),'December')]]", namespaces=NSMAP)[0]
assert line

# The textline should a. contain multiple words and b. these should concatenate fine to produce the same line text
words = line.xpath(".//pc:Word", namespaces=NSMAP)
line = tree.xpath(
".//pc:TextLine[pc:TextEquiv/pc:Unicode[contains(text(),'December')]]",
namespaces=nsmap,
)[0]
assert line is not None

# The textline should
# a. contain multiple words and
# b. these should concatenate fine to produce the same line text
words = line.xpath(".//pc:Word", namespaces=nsmap)
assert len(words) >= 2
words_text = " ".join(word.xpath("pc:TextEquiv/pc:Unicode", namespaces=NSMAP)[0].text for word in words)
line_text = line.xpath("pc:TextEquiv/pc:Unicode", namespaces=NSMAP)[0].text
words_text = " ".join(
word.xpath("pc:TextEquiv/pc:Unicode", namespaces=nsmap)[0].text
for word in words
)
line_text = line.xpath("pc:TextEquiv/pc:Unicode", namespaces=nsmap)[0].text
assert words_text == line_text

# For extra measure, check that we're not seeing any glyphs, as we asked for textequiv_level == "word"
glyphs = tree.xpath("//pc:Glyph", namespaces=NSMAP)
# For extra measure, check that we're not seeing any glyphs, as we asked for
# textequiv_level == "word"
glyphs = tree.xpath("//pc:Glyph", namespaces=nsmap)
assert len(glyphs) == 0


def test_glyphs(workspace):
CalamariRecognize(
workspace,
input_file_grp="OCR-D-GT-SEG-LINE",
input_file_grp="OCR-D-GT-SEG-WORD-GLYPH",
output_file_grp="OCR-D-OCR-CALAMARI",
parameter={
"checkpoint": CHECKPOINT,
"textequiv_level": "glyph", # Note that we're going down to glyph level here
}
"checkpoint_dir": CHECKPOINT_DIR,
# Note that we're going down to glyph level here
"textequiv_level": "glyph",
},
).process()
workspace.save_mets()

page1 = os.path.join(workspace.directory, "OCR-D-OCR-CALAMARI/OCR-D-OCR-CALAMARI_0001.xml")
page1 = os.path.join(
workspace.directory, "OCR-D-OCR-CALAMARI/OCR-D-OCR-CALAMARI_phys_0001.xml"
)
assert os.path.exists(page1)
tree = etree.parse(page1)
nsmap = {"pc": page_namespace(tree)}

# The result should contain a lot of glyphs
glyphs = tree.xpath("//pc:Glyph", namespaces=NSMAP)
glyphs = tree.xpath("//pc:Glyph", namespaces=nsmap)
assert len(glyphs) >= 100


# vim:tw=120: