diff --git a/404.html b/404.html index c9bd90a0..37312863 100644 --- a/404.html +++ b/404.html @@ -15,7 +15,7 @@
- + \ No newline at end of file diff --git a/about/activities.html b/about/activities.html index f5fe4ce7..7cb4d92b 100644 --- a/about/activities.html +++ b/about/activities.html @@ -12,13 +12,13 @@ - + -
Skip to content

Activities

Publications (assortment)

  • OCR-D & OCR4all: Two Complementary Approaches for Improved OCR of Historical Sources. Baierer, Konstantin; Büttner, Andreas; Engl, Elisabeth; Hinrichsen, Lena; Reul, Christian in 6th International Workshop on Computational History (2021). PDF
  • OCR4all - Eine semi-automatische Open-Source-Software für die OCR historischer Drucke. Wehner, Maximilian; Dahnke, Michael; Landes, Florian; Nasarek, Robert; Reul, Christian in DHd 2020 Spielräume: Digital Humanities zwischen Modellierung und Interpretation. Konferenzabstracts (2020). PDF
  • OCR4all - An Open-Source Tool Providing a (Semi-)Automatic OCR Workflow for Historical Printings. Reul, Christian; Christ, Dennis; Hartelt, Alexander; Balbach, Nico; Wehner, Maximilian; Springmann, Uwe; Wick, Christoph; Grundig, Christine; Büttner, Andreas; Puppe, Frank in ArXiv Preprints (submitted to MDPI - Applied Sciences) (2019). PDF
  • Texterkennungssoftware für historische Drucke. Wehner, Maximilian in KulturBetrieb 25 (2019). PDF
  • Further related publications can be found here.

Talks (assortment)

  • OCR4all - Insights and Prospects - Error Correcting HTR Workshop (Venice, November 2022).
  • Tagung Digitale Mediävistik. Perspektiven der Digital Humanities für die Altgermanistik (Bremen, February 2022).
  • Erschließung gedruckter und handschriftlicher Quellen mit OCR4all - dhistory (Berlin, November 2021).
  • OCR4all – Möglichkeiten und Grenzen einer Erfassung historischer Drucke und Handschriften, Faithful Transcriptions – Ein digitales Crowd-Sourcing-Projekt zu theologischen Handschriften des Mittelalters (Berlin, June 2021).
  • OCR und HTR, Mediävistisches Oberseminar (Würzburg, Juli 2020).
  • OCR4all – Eine (semi)automatische Open Source Software für die OCR historischer Drucke, Kolloquium Korpuslinguistik und Phonetik (Berlin, January 2020)
  • OCR4all – Eine Texterkennungssoftware für historische Drucke, Arbeitskreis Provenienzforschung e.V., Nordrhein-Westfälische Akademie der Wissenschaften (Düsseldorf, November 2019).
  • OCR4all – Erfahrungen und Evaluationsergebnisse der OCR historischer Drucke vom 15. bis 19. Jahrhundert in kritischer Betrachtung, Workshop der DHd AG OCR (Halle, November 2019).
  • OCR4all – Eine semi-automatische Open-Source-Software für die OCR historischer Drucke, Bessere Literaturversorgung der Philosophieforschung mit digitalisierten Quellen (Köln, November 2019).
  • OCR4all – Ein vollständiger OCR Workflow gekapselt in einem Open Source Tool, Praxis der Digital Humanities, Trier Center for Digital Humanities (Trier, January 2019).

Teaching (assortment)

  • A part module of the Zusatzzertifikat Digitale Kompetenz, which offers modern languages students the chance to acquire and prove competencies in the handling of digital data (University of Würzburg, since winter semester 2020).
  • Text digitization with OCR4all as part of the "Basismodul Digitalisierung" of the chair of Computer Philology and Modern German Literary Studies (University of Würzburg, since summer semester 2020).
  • Digitization of historical printed works with OCR4all as part of the “Aufbaumodul Analysepraxis der Sprachwissenschaft” of the chair of German Linguistics (University of Würzburg, summer semester 2020).
  • OCR internship for Master's students of the study program Mittelalter und Frühe Neuzeit (University of Würzburg, since winter semester 2018/19).
  • Seminar Historische Korpuslinguistik held at the Humboldt University Berlin: in cooperation with the ZPD several works on the subject of herbs from the 17th and 18th century were transcribed by Master’s students. The required calculations were processed on the servers in Würzburg, while the students were able to make the necessary corrections remotely and conveniently via a web interface after a short briefing.

Workshops (assortment)

  • Digital workshop for students and employees of the University of Würzburg and the Georg-Eckert-Institut – Leibniz-Institut für international Schulbuchforschung in Braunschweig (January 2021).
  • Workshop during the annual meeting of Digital Humanities in the German-speaking area (Paderborn, March 2020).
  • Workshop in the course of the teaching project “Digital Visual Studies” at the Art History Institute of the University of Zurich in the program “Stärkung der digitalen Kompetenzen im Bildungsbereich” (Zurich, October 2019).
  • Workshop for employees of the Schweizerdeutsches Idiotikon (Zurich, October 2019).
  • Two workshops in the course of the COST Action Distant Reading for European Literary History (Würzburg, April 2018 and Budapest, September 2019).
  • Train-the-Trainer workshop: training of and exchange with colleagues who themselves conduct OCR workshops using OCR4all; development of common concepts and standards (Würzburg, August 2019).
  • Workshop for students and employees of the Trier Center for Digital Humanities (Trier, January 2019).
  • Regular OCR4all workshops for students and employees of the University of Würzburg.

Application and diffusion (assortment)

  • Wide range of usage within the university of Würzburg, e.g.:
    • extensive use in the field of early modern prints conducted by the College of Middle Ages and Early Modern Period
    • bulk digital recording of fracture novels of the 19th century at the chair of Computer Philology (over 1,000 novels have already been processed)
    • Participation in several project proposals, including a project in cooperation with the HU Berlin for the indexing of the Digital Fairy Tale Reference Library of Jacob and Wilhelm Grimm, as well as the proposed academy project Sprachgitter digital: Historical-Critical Jean Paul Edition
  • Recently approved project Danish Neo-Latin Literature (Aarhus University, National Cultural Heritage Cluster, Danish Royal Library): High-quality (comprehensive GT creation and work-specific training) registration of 856 Danish prints from 1482 to 1600
  • MiMoText project at the Kompetenzzentrum of the University of Trier: Recording French novels of the 18th century
  • Monumenta Germaniae Historica: High-quality recording of encyclopedias from the incunabula period (cooperation with the ZPD for the preparation of a LIS project application)
  • Department of English, University of Bristol: The Literary Heritage of Anglo-Dutch Relations, 1050-1600
  • Universidad Nacional de Educación a Distancia (Madrid): Project for the recording of Latin texts of the 15th and 16th centuries
  • Kommission für bayerische Landesgeschichte an der Bayerischen Akademie der Wissenschaften: recording of various printed works and typewritten products
  • WiTTFind project at the CIS of the LMU München: Processing of different material, including strengthened typewriter pages
- +
Skip to content

Activities

Publications (assortment)

  • OCR-D & OCR4all: Two Complementary Approaches for Improved OCR of Historical Sources. Baierer, Konstantin; Büttner, Andreas; Engl, Elisabeth; Hinrichsen, Lena; Reul, Christian in 6th International Workshop on Computational History (2021). PDF
  • OCR4all - Eine semi-automatische Open-Source-Software für die OCR historischer Drucke. Wehner, Maximilian; Dahnke, Michael; Landes, Florian; Nasarek, Robert; Reul, Christian in DHd 2020 Spielräume: Digital Humanities zwischen Modellierung und Interpretation. Konferenzabstracts (2020). PDF
  • OCR4all - An Open-Source Tool Providing a (Semi-)Automatic OCR Workflow for Historical Printings. Reul, Christian; Christ, Dennis; Hartelt, Alexander; Balbach, Nico; Wehner, Maximilian; Springmann, Uwe; Wick, Christoph; Grundig, Christine; Büttner, Andreas; Puppe, Frank in ArXiv Preprints (submitted to MDPI - Applied Sciences) (2019). PDF
  • Texterkennungssoftware für historische Drucke. Wehner, Maximilian in KulturBetrieb 25 (2019). PDF
  • Further related publications can be found here.

Talks (assortment)

  • OCR4all - Insights and Prospects - Error Correcting HTR Workshop (Venice, November 2022).
  • Tagung Digitale Mediävistik. Perspektiven der Digital Humanities für die Altgermanistik (Bremen, February 2022).
  • Erschließung gedruckter und handschriftlicher Quellen mit OCR4all - dhistory (Berlin, November 2021).
  • OCR4all – Möglichkeiten und Grenzen einer Erfassung historischer Drucke und Handschriften, Faithful Transcriptions – Ein digitales Crowd-Sourcing-Projekt zu theologischen Handschriften des Mittelalters (Berlin, June 2021).
  • OCR und HTR, Mediävistisches Oberseminar (Würzburg, Juli 2020).
  • OCR4all – Eine (semi)automatische Open Source Software für die OCR historischer Drucke, Kolloquium Korpuslinguistik und Phonetik (Berlin, January 2020)
  • OCR4all – Eine Texterkennungssoftware für historische Drucke, Arbeitskreis Provenienzforschung e.V., Nordrhein-Westfälische Akademie der Wissenschaften (Düsseldorf, November 2019).
  • OCR4all – Erfahrungen und Evaluationsergebnisse der OCR historischer Drucke vom 15. bis 19. Jahrhundert in kritischer Betrachtung, Workshop der DHd AG OCR (Halle, November 2019).
  • OCR4all – Eine semi-automatische Open-Source-Software für die OCR historischer Drucke, Bessere Literaturversorgung der Philosophieforschung mit digitalisierten Quellen (Köln, November 2019).
  • OCR4all – Ein vollständiger OCR Workflow gekapselt in einem Open Source Tool, Praxis der Digital Humanities, Trier Center for Digital Humanities (Trier, January 2019).

Teaching (assortment)

  • A part module of the Zusatzzertifikat Digitale Kompetenz, which offers modern languages students the chance to acquire and prove competencies in the handling of digital data (University of Würzburg, since winter semester 2020).
  • Text digitization with OCR4all as part of the "Basismodul Digitalisierung" of the chair of Computer Philology and Modern German Literary Studies (University of Würzburg, since summer semester 2020).
  • Digitization of historical printed works with OCR4all as part of the “Aufbaumodul Analysepraxis der Sprachwissenschaft” of the chair of German Linguistics (University of Würzburg, summer semester 2020).
  • OCR internship for Master's students of the study program Mittelalter und Frühe Neuzeit (University of Würzburg, since winter semester 2018/19).
  • Seminar Historische Korpuslinguistik held at the Humboldt University Berlin: in cooperation with the ZPD several works on the subject of herbs from the 17th and 18th century were transcribed by Master’s students. The required calculations were processed on the servers in Würzburg, while the students were able to make the necessary corrections remotely and conveniently via a web interface after a short briefing.

Workshops (assortment)

  • Digital workshop for students and employees of the University of Würzburg and the Georg-Eckert-Institut – Leibniz-Institut für international Schulbuchforschung in Braunschweig (January 2021).
  • Workshop during the annual meeting of Digital Humanities in the German-speaking area (Paderborn, March 2020).
  • Workshop in the course of the teaching project “Digital Visual Studies” at the Art History Institute of the University of Zurich in the program “Stärkung der digitalen Kompetenzen im Bildungsbereich” (Zurich, October 2019).
  • Workshop for employees of the Schweizerdeutsches Idiotikon (Zurich, October 2019).
  • Two workshops in the course of the COST Action Distant Reading for European Literary History (Würzburg, April 2018 and Budapest, September 2019).
  • Train-the-Trainer workshop: training of and exchange with colleagues who themselves conduct OCR workshops using OCR4all; development of common concepts and standards (Würzburg, August 2019).
  • Workshop for students and employees of the Trier Center for Digital Humanities (Trier, January 2019).
  • Regular OCR4all workshops for students and employees of the University of Würzburg.

Application and diffusion (assortment)

  • Wide range of usage within the university of Würzburg, e.g.:
    • extensive use in the field of early modern prints conducted by the College of Middle Ages and Early Modern Period
    • bulk digital recording of fracture novels of the 19th century at the chair of Computer Philology (over 1,000 novels have already been processed)
    • Participation in several project proposals, including a project in cooperation with the HU Berlin for the indexing of the Digital Fairy Tale Reference Library of Jacob and Wilhelm Grimm, as well as the proposed academy project Sprachgitter digital: Historical-Critical Jean Paul Edition
  • Recently approved project Danish Neo-Latin Literature (Aarhus University, National Cultural Heritage Cluster, Danish Royal Library): High-quality (comprehensive GT creation and work-specific training) registration of 856 Danish prints from 1482 to 1600
  • MiMoText project at the Kompetenzzentrum of the University of Trier: Recording French novels of the 18th century
  • Monumenta Germaniae Historica: High-quality recording of encyclopedias from the incunabula period (cooperation with the ZPD for the preparation of a LIS project application)
  • Department of English, University of Bristol: The Literary Heritage of Anglo-Dutch Relations, 1050-1600
  • Universidad Nacional de Educación a Distancia (Madrid): Project for the recording of Latin texts of the 15th and 16th centuries
  • Kommission für bayerische Landesgeschichte an der Bayerischen Akademie der Wissenschaften: recording of various printed works and typewritten products
  • WiTTFind project at the CIS of the LMU München: Processing of different material, including strengthened typewriter pages
+ \ No newline at end of file diff --git a/about/ocr4all.html b/about/ocr4all.html index f0f6f2c2..dc6875e0 100644 --- a/about/ocr4all.html +++ b/about/ocr4all.html @@ -12,15 +12,15 @@ - + -
Skip to content

What is OCR4all?

OCR4all combines various open-source solutions to provide a fully automated workflow for automatic text recognition of historical printed (OCR) and handwritten (HTR) material. At pretty much any stage of the workflow the user can interact with the results in order to minimize consequential errors and optimize the end result.

Due to its comprehensible and intuitive handling OCR4all explicitly addresses the needs of non-technical users.

With the closure of the second project stage of the BMBF-funded joint project Kallimachos the software is now being established at the center for philology and digitally of the University of Würzburg, which opens the program up for the widest possible user group.

Workflow

The workflow starts with the Preprocessing of the relevant image files. Layout segmentation (so-called Region Segmentation carried out with LAREX and Line Segmentation follow. Next is the Text Recognition which is carried out with Calamari. The final stage is the correction of the recognized texts the so-called Ground Truth Production. This Ground Truth is then the foundation for creating work-specific OCR models in a training module. Therefore OCR4all entails a full-featured OCR workflow.

Workflow

Particularly due to the capacity to create and train work-specific text recognition models, OCR4all makes achieving high-quality results in the digitization of texts in nearly all printed documents possible.

SegmentationCorrection

Cooperation with OCR-D

In the summer of 2020, a co-operation between OCR4all and the coordinated funding initiative for further development of processes involving Optical Character Recognition (OCR-D) was arranged.

The main goal of the DFG-funded OCR-D project was the conceptual as well as technical preparation of the mass digitization of printed texts published in german-speaking areas from the 16th to the 18th century (VD16, VD17, VD18).

For this purpose, the automatic full-text recognition, analogous to the OCR4all approach, is divided into individual process steps that can be reproduced in the Open Source OCR-D software. This aims to create optimized workflows for the old prints to be processed and thus generating scientifically applicable full texts.

The aim of the co-operation is not only the continuous exchange of information mainly about interfaces, scalable software implementations, creation and provision of GT but the upcoming developments in the OCR field as well. Furthermore, it strives to achieve a technical convergence of the two projects. For this purpose, OCR4all will implement the OCR-D specifications in its OCR solution and realize its interfaces for OCR-D tools. With OCR4all's internal use of OCR-D solutions, OCR4all users will benefit from the extended selection of tools and the associated possibilities, whereas OCR-D will have a broader scope and, through simplified access, will also reach new user groups inside and outside VD mass digitization.

Reporting (assortment)

Cite

If you are using OCR4all please cite the corresponding paper:

Reul, C., Christ, D., Hartelt, A., Balbach, N., Wehner, M., Springmann, U., Wick, C., Grundig, Büttner, A., C.,
+    
Skip to content

What is OCR4all?

OCR4all combines various open-source solutions to provide a fully automated workflow for automatic text recognition of historical printed (OCR) and handwritten (HTR) material. At pretty much any stage of the workflow the user can interact with the results in order to minimize consequential errors and optimize the end result.

Due to its comprehensible and intuitive handling OCR4all explicitly addresses the needs of non-technical users.

With the closure of the second project stage of the BMBF-funded joint project Kallimachos the software is now being established at the center for philology and digitally of the University of Würzburg, which opens the program up for the widest possible user group.

Workflow

The workflow starts with the Preprocessing of the relevant image files. Layout segmentation (so-called Region Segmentation carried out with LAREX and Line Segmentation follow. Next is the Text Recognition which is carried out with Calamari. The final stage is the correction of the recognized texts the so-called Ground Truth Production. This Ground Truth is then the foundation for creating work-specific OCR models in a training module. Therefore OCR4all entails a full-featured OCR workflow.

Workflow

Particularly due to the capacity to create and train work-specific text recognition models, OCR4all makes achieving high-quality results in the digitization of texts in nearly all printed documents possible.

SegmentationCorrection

Cooperation with OCR-D

In the summer of 2020, a co-operation between OCR4all and the coordinated funding initiative for further development of processes involving Optical Character Recognition (OCR-D) was arranged.

The main goal of the DFG-funded OCR-D project was the conceptual as well as technical preparation of the mass digitization of printed texts published in german-speaking areas from the 16th to the 18th century (VD16, VD17, VD18).

For this purpose, the automatic full-text recognition, analogous to the OCR4all approach, is divided into individual process steps that can be reproduced in the Open Source OCR-D software. This aims to create optimized workflows for the old prints to be processed and thus generating scientifically applicable full texts.

The aim of the co-operation is not only the continuous exchange of information mainly about interfaces, scalable software implementations, creation and provision of GT but the upcoming developments in the OCR field as well. Furthermore, it strives to achieve a technical convergence of the two projects. For this purpose, OCR4all will implement the OCR-D specifications in its OCR solution and realize its interfaces for OCR-D tools. With OCR4all's internal use of OCR-D solutions, OCR4all users will benefit from the extended selection of tools and the associated possibilities, whereas OCR-D will have a broader scope and, through simplified access, will also reach new user groups inside and outside VD mass digitization.

Reporting (assortment)

Cite

If you are using OCR4all please cite the corresponding paper:

Reul, C., Christ, D., Hartelt, A., Balbach, N., Wehner, M., Springmann, U., Wick, C., Grundig, Büttner, A., C.,
 Puppe, F.: OCR4all — An open-source tool providing a (semi-) automatic OCR workflow for historical printings,
-Applied Sciences 9(22) (2019)

Funding

- +Applied Sciences 9(22) (2019)

Funding

+ \ No newline at end of file diff --git a/about/projects.html b/about/projects.html index f67fe4ce..9a2a1803 100644 --- a/about/projects.html +++ b/about/projects.html @@ -12,13 +12,13 @@ - + -
Skip to content

Projects

OCR4all-libraries

OCR4all-libraries – Full-Text Transformation of Historical Collections (DFG, 2021-23)

Camerarius digital

Project homepage (DFG, 2021-24)

Narragonien digital

Project homepage (BMBF, 2014-2019)

- +
Skip to content

Projects

OCR4all-libraries

OCR4all-libraries – Full-Text Transformation of Historical Collections (DFG, 2021-23)

Camerarius digital

Project homepage (DFG, 2021-24)

Narragonien digital

Project homepage (BMBF, 2014-2019)

+ \ No newline at end of file diff --git a/about/team.html b/about/team.html index be2eed84..65c90576 100644 --- a/about/team.html +++ b/about/team.html @@ -12,13 +12,13 @@ - + -
Skip to content

Team

Project lead

  • Dr. Christian Reul 📧

User support

  • Florian Langhanki 📧

Development

  • Dr. Herbert Baier Saip (OCR4all back end) 📧
  • Maximilian Nöth (OCR4all front end, LAREX and distribution) 📧
  • Kevin Chadbourne (LAREX)
  • Andreas Büttner (Calamari)

Miscellaneous

  • Prof. Dr. Frank Puppe (Funding, Ideas and Feedback)
  • Dr. Uwe Springmann (Ideas and Feedback)
  • Kristof Korwisi (Usability)
  • Raphaëlle Jung (Guides and Illustrations)

Former project staff

  • Nico Balbach (OCR4all and LAREX)
  • Dennis Christ (OCR4all)
  • Annika Müller (User support)
  • Björn Eyselein (Artifactory and distribution via Docker)
  • Christine Grundig (Ideas and Feedback)
  • Alexander Hartelt (OCR4all)
  • Yannik Herbst (OCR4all and distribution via VirtualBox)
  • Isabel Müller (Website)
  • Maximilian Wehner (User Support)
  • Dr. Christoph Wick (Calamari)
- +
Skip to content

Team

Project lead

  • Dr. Christian Reul 📧

User support

  • Florian Langhanki 📧

Development

  • Dr. Herbert Baier Saip (OCR4all back end) 📧
  • Maximilian Nöth (OCR4all front end, LAREX and distribution) 📧
  • Kevin Chadbourne (LAREX)
  • Andreas Büttner (Calamari)

Miscellaneous

  • Prof. Dr. Frank Puppe (Funding, Ideas and Feedback)
  • Dr. Uwe Springmann (Ideas and Feedback)
  • Kristof Korwisi (Usability)
  • Raphaëlle Jung (Guides and Illustrations)

Former project staff

  • Nico Balbach (OCR4all and LAREX)
  • Dennis Christ (OCR4all)
  • Annika Müller (User support)
  • Björn Eyselein (Artifactory and distribution via Docker)
  • Christine Grundig (Ideas and Feedback)
  • Alexander Hartelt (OCR4all)
  • Yannik Herbst (OCR4all and distribution via VirtualBox)
  • Isabel Müller (Website)
  • Maximilian Wehner (User Support)
  • Dr. Christoph Wick (Calamari)
+ \ No newline at end of file diff --git a/assets/about_activities.md.D4dF5vOZ.js b/assets/about_activities.md.D1i_SJGN.js similarity index 99% rename from assets/about_activities.md.D4dF5vOZ.js rename to assets/about_activities.md.D1i_SJGN.js index 9df70897..14e11586 100644 --- a/assets/about_activities.md.D4dF5vOZ.js +++ b/assets/about_activities.md.D1i_SJGN.js @@ -1 +1 @@ -import{_ as r,c as t,j as e,a,t as n,a4 as o,o as s}from"./chunks/framework.CI6U-QuP.js";const w=JSON.parse('{"title":"Activities","description":"","frontmatter":{"title":"Activities"},"headers":[],"relativePath":"about/activities.md","filePath":"about/activities.md","lastUpdated":1724226141000}'),l={name:"about/activities.md"},h={id:"frontmatter-title",tabindex:"-1"},u=e("a",{class:"header-anchor",href:"#frontmatter-title","aria-label":'Permalink to "{{ $frontmatter.title }}"'},"​",-1),c=o('

Publications (assortment)

Talks (assortment)

Teaching (assortment)

Workshops (assortment)

Application and diffusion (assortment)

',10);function d(i,f,g,p,m,b){return s(),t("div",null,[e("h1",h,[a(n(i.$frontmatter.title)+" ",1),u]),c])}const v=r(l,[["render",d]]);export{w as __pageData,v as default}; +import{_ as r,c as t,j as e,a,t as n,a4 as o,o as s}from"./chunks/framework.CI6U-QuP.js";const w=JSON.parse('{"title":"Activities","description":"","frontmatter":{"title":"Activities"},"headers":[],"relativePath":"about/activities.md","filePath":"about/activities.md","lastUpdated":1724832369000}'),l={name:"about/activities.md"},h={id:"frontmatter-title",tabindex:"-1"},u=e("a",{class:"header-anchor",href:"#frontmatter-title","aria-label":'Permalink to "{{ $frontmatter.title }}"'},"​",-1),c=o('

Publications (assortment)

Talks (assortment)

Teaching (assortment)

Workshops (assortment)

Application and diffusion (assortment)

',10);function d(i,f,g,p,m,b){return s(),t("div",null,[e("h1",h,[a(n(i.$frontmatter.title)+" ",1),u]),c])}const v=r(l,[["render",d]]);export{w as __pageData,v as default}; diff --git a/assets/about_activities.md.D4dF5vOZ.lean.js b/assets/about_activities.md.D1i_SJGN.lean.js similarity index 90% rename from assets/about_activities.md.D4dF5vOZ.lean.js rename to assets/about_activities.md.D1i_SJGN.lean.js index 355147a5..f304eef3 100644 --- a/assets/about_activities.md.D4dF5vOZ.lean.js +++ b/assets/about_activities.md.D1i_SJGN.lean.js @@ -1 +1 @@ -import{_ as r,c as t,j as e,a,t as n,a4 as o,o as s}from"./chunks/framework.CI6U-QuP.js";const w=JSON.parse('{"title":"Activities","description":"","frontmatter":{"title":"Activities"},"headers":[],"relativePath":"about/activities.md","filePath":"about/activities.md","lastUpdated":1724226141000}'),l={name:"about/activities.md"},h={id:"frontmatter-title",tabindex:"-1"},u=e("a",{class:"header-anchor",href:"#frontmatter-title","aria-label":'Permalink to "{{ $frontmatter.title }}"'},"​",-1),c=o("",10);function d(i,f,g,p,m,b){return s(),t("div",null,[e("h1",h,[a(n(i.$frontmatter.title)+" ",1),u]),c])}const v=r(l,[["render",d]]);export{w as __pageData,v as default}; +import{_ as r,c as t,j as e,a,t as n,a4 as o,o as s}from"./chunks/framework.CI6U-QuP.js";const w=JSON.parse('{"title":"Activities","description":"","frontmatter":{"title":"Activities"},"headers":[],"relativePath":"about/activities.md","filePath":"about/activities.md","lastUpdated":1724832369000}'),l={name:"about/activities.md"},h={id:"frontmatter-title",tabindex:"-1"},u=e("a",{class:"header-anchor",href:"#frontmatter-title","aria-label":'Permalink to "{{ $frontmatter.title }}"'},"​",-1),c=o("",10);function d(i,f,g,p,m,b){return s(),t("div",null,[e("h1",h,[a(n(i.$frontmatter.title)+" ",1),u]),c])}const v=r(l,[["render",d]]);export{w as __pageData,v as default}; diff --git a/assets/about_ocr4all.md.KnWC4XWQ.js b/assets/about_ocr4all.md.OICOy45v.js similarity index 99% rename from assets/about_ocr4all.md.KnWC4XWQ.js rename to assets/about_ocr4all.md.OICOy45v.js index c6cb8c53..4792b927 100644 --- a/assets/about_ocr4all.md.KnWC4XWQ.js +++ b/assets/about_ocr4all.md.OICOy45v.js @@ -1,3 +1,3 @@ -import{_ as r,c as a,j as e,a as i,t as o,a4 as n,o as l}from"./chunks/framework.CI6U-QuP.js";const s="/images/about/ocr4all/workflow.png",h="/images/about/ocr4all/ocr4all-complex.png",c="/images/about/ocr4all/larex-corr.png",R=JSON.parse('{"title":"What is OCR4all?","description":"","frontmatter":{"title":"What is OCR4all?"},"headers":[],"relativePath":"about/ocr4all.md","filePath":"about/ocr4all.md","lastUpdated":1724226141000}'),p={name:"about/ocr4all.md"},d={id:"frontmatter-title",tabindex:"-1"},u=e("a",{class:"header-anchor",href:"#frontmatter-title","aria-label":'Permalink to "{{ $frontmatter.title }}"'},"​",-1),f=n('

OCR4all combines various open-source solutions to provide a fully automated workflow for automatic text recognition of historical printed (OCR) and handwritten (HTR) material. At pretty much any stage of the workflow the user can interact with the results in order to minimize consequential errors and optimize the end result.

Due to its comprehensible and intuitive handling OCR4all explicitly addresses the needs of non-technical users.

With the closure of the second project stage of the BMBF-funded joint project Kallimachos the software is now being established at the center for philology and digitally of the University of Würzburg, which opens the program up for the widest possible user group.

Workflow

The workflow starts with the Preprocessing of the relevant image files. Layout segmentation (so-called Region Segmentation carried out with LAREX and Line Segmentation follow. Next is the Text Recognition which is carried out with Calamari. The final stage is the correction of the recognized texts the so-called Ground Truth Production. This Ground Truth is then the foundation for creating work-specific OCR models in a training module. Therefore OCR4all entails a full-featured OCR workflow.

Workflow

Particularly due to the capacity to create and train work-specific text recognition models, OCR4all makes achieving high-quality results in the digitization of texts in nearly all printed documents possible.

SegmentationCorrection

Cooperation with OCR-D

In the summer of 2020, a co-operation between OCR4all and the coordinated funding initiative for further development of processes involving Optical Character Recognition (OCR-D) was arranged.

The main goal of the DFG-funded OCR-D project was the conceptual as well as technical preparation of the mass digitization of printed texts published in german-speaking areas from the 16th to the 18th century (VD16, VD17, VD18).

For this purpose, the automatic full-text recognition, analogous to the OCR4all approach, is divided into individual process steps that can be reproduced in the Open Source OCR-D software. This aims to create optimized workflows for the old prints to be processed and thus generating scientifically applicable full texts.

The aim of the co-operation is not only the continuous exchange of information mainly about interfaces, scalable software implementations, creation and provision of GT but the upcoming developments in the OCR field as well. Furthermore, it strives to achieve a technical convergence of the two projects. For this purpose, OCR4all will implement the OCR-D specifications in its OCR solution and realize its interfaces for OCR-D tools. With OCR4all's internal use of OCR-D solutions, OCR4all users will benefit from the extended selection of tools and the associated possibilities, whereas OCR-D will have a broader scope and, through simplified access, will also reach new user groups inside and outside VD mass digitization.

Reporting (assortment)

Cite

If you are using OCR4all please cite the corresponding paper:

Reul, C., Christ, D., Hartelt, A., Balbach, N., Wehner, M., Springmann, U., Wick, C., Grundig, Büttner, A., C.,
+import{_ as r,c as a,j as e,a as i,t as o,a4 as n,o as l}from"./chunks/framework.CI6U-QuP.js";const s="/images/about/ocr4all/workflow.png",h="/images/about/ocr4all/ocr4all-complex.png",c="/images/about/ocr4all/larex-corr.png",R=JSON.parse('{"title":"What is OCR4all?","description":"","frontmatter":{"title":"What is OCR4all?"},"headers":[],"relativePath":"about/ocr4all.md","filePath":"about/ocr4all.md","lastUpdated":1724832369000}'),p={name:"about/ocr4all.md"},d={id:"frontmatter-title",tabindex:"-1"},u=e("a",{class:"header-anchor",href:"#frontmatter-title","aria-label":'Permalink to "{{ $frontmatter.title }}"'},"​",-1),f=n('

OCR4all combines various open-source solutions to provide a fully automated workflow for automatic text recognition of historical printed (OCR) and handwritten (HTR) material. At pretty much any stage of the workflow the user can interact with the results in order to minimize consequential errors and optimize the end result.

Due to its comprehensible and intuitive handling OCR4all explicitly addresses the needs of non-technical users.

With the closure of the second project stage of the BMBF-funded joint project Kallimachos the software is now being established at the center for philology and digitally of the University of Würzburg, which opens the program up for the widest possible user group.

Workflow

The workflow starts with the Preprocessing of the relevant image files. Layout segmentation (so-called Region Segmentation carried out with LAREX and Line Segmentation follow. Next is the Text Recognition which is carried out with Calamari. The final stage is the correction of the recognized texts the so-called Ground Truth Production. This Ground Truth is then the foundation for creating work-specific OCR models in a training module. Therefore OCR4all entails a full-featured OCR workflow.

Workflow

Particularly due to the capacity to create and train work-specific text recognition models, OCR4all makes achieving high-quality results in the digitization of texts in nearly all printed documents possible.

SegmentationCorrection

Cooperation with OCR-D

In the summer of 2020, a co-operation between OCR4all and the coordinated funding initiative for further development of processes involving Optical Character Recognition (OCR-D) was arranged.

The main goal of the DFG-funded OCR-D project was the conceptual as well as technical preparation of the mass digitization of printed texts published in german-speaking areas from the 16th to the 18th century (VD16, VD17, VD18).

For this purpose, the automatic full-text recognition, analogous to the OCR4all approach, is divided into individual process steps that can be reproduced in the Open Source OCR-D software. This aims to create optimized workflows for the old prints to be processed and thus generating scientifically applicable full texts.

The aim of the co-operation is not only the continuous exchange of information mainly about interfaces, scalable software implementations, creation and provision of GT but the upcoming developments in the OCR field as well. Furthermore, it strives to achieve a technical convergence of the two projects. For this purpose, OCR4all will implement the OCR-D specifications in its OCR solution and realize its interfaces for OCR-D tools. With OCR4all's internal use of OCR-D solutions, OCR4all users will benefit from the extended selection of tools and the associated possibilities, whereas OCR-D will have a broader scope and, through simplified access, will also reach new user groups inside and outside VD mass digitization.

Reporting (assortment)

Cite

If you are using OCR4all please cite the corresponding paper:

Reul, C., Christ, D., Hartelt, A., Balbach, N., Wehner, M., Springmann, U., Wick, C., Grundig, Büttner, A., C.,
 Puppe, F.: OCR4all — An open-source tool providing a (semi-) automatic OCR workflow for historical printings,
 Applied Sciences 9(22) (2019)

Funding

`,20);function g(t,m,w,b,k,C){return l(),a("div",null,[e("h1",d,[i(o(t.$frontmatter.title)+" ",1),u]),f])}const O=r(p,[["render",g]]);export{R as __pageData,O as default}; diff --git a/assets/about_ocr4all.md.KnWC4XWQ.lean.js b/assets/about_ocr4all.md.OICOy45v.lean.js similarity index 92% rename from assets/about_ocr4all.md.KnWC4XWQ.lean.js rename to assets/about_ocr4all.md.OICOy45v.lean.js index dd57ccd1..7cfdb613 100644 --- a/assets/about_ocr4all.md.KnWC4XWQ.lean.js +++ b/assets/about_ocr4all.md.OICOy45v.lean.js @@ -1 +1 @@ -import{_ as r,c as a,j as e,a as i,t as o,a4 as n,o as l}from"./chunks/framework.CI6U-QuP.js";const s="/images/about/ocr4all/workflow.png",h="/images/about/ocr4all/ocr4all-complex.png",c="/images/about/ocr4all/larex-corr.png",R=JSON.parse('{"title":"What is OCR4all?","description":"","frontmatter":{"title":"What is OCR4all?"},"headers":[],"relativePath":"about/ocr4all.md","filePath":"about/ocr4all.md","lastUpdated":1724226141000}'),p={name:"about/ocr4all.md"},d={id:"frontmatter-title",tabindex:"-1"},u=e("a",{class:"header-anchor",href:"#frontmatter-title","aria-label":'Permalink to "{{ $frontmatter.title }}"'},"​",-1),f=n("",20);function g(t,m,w,b,k,C){return l(),a("div",null,[e("h1",d,[i(o(t.$frontmatter.title)+" ",1),u]),f])}const O=r(p,[["render",g]]);export{R as __pageData,O as default}; +import{_ as r,c as a,j as e,a as i,t as o,a4 as n,o as l}from"./chunks/framework.CI6U-QuP.js";const s="/images/about/ocr4all/workflow.png",h="/images/about/ocr4all/ocr4all-complex.png",c="/images/about/ocr4all/larex-corr.png",R=JSON.parse('{"title":"What is OCR4all?","description":"","frontmatter":{"title":"What is OCR4all?"},"headers":[],"relativePath":"about/ocr4all.md","filePath":"about/ocr4all.md","lastUpdated":1724832369000}'),p={name:"about/ocr4all.md"},d={id:"frontmatter-title",tabindex:"-1"},u=e("a",{class:"header-anchor",href:"#frontmatter-title","aria-label":'Permalink to "{{ $frontmatter.title }}"'},"​",-1),f=n("",20);function g(t,m,w,b,k,C){return l(),a("div",null,[e("h1",d,[i(o(t.$frontmatter.title)+" ",1),u]),f])}const O=r(p,[["render",g]]);export{R as __pageData,O as default}; diff --git a/assets/about_projects.md.Cd7nA-A5.js b/assets/about_projects.md.BN7d1s9v.js similarity index 96% rename from assets/about_projects.md.Cd7nA-A5.js rename to assets/about_projects.md.BN7d1s9v.js index 2845c664..b288628f 100644 --- a/assets/about_projects.md.Cd7nA-A5.js +++ b/assets/about_projects.md.BN7d1s9v.js @@ -1 +1 @@ -import{_ as r,c as t,j as a,a as i,t as o,a4 as l,o as n}from"./chunks/framework.CI6U-QuP.js";const P=JSON.parse('{"title":"Projects","description":"","frontmatter":{"title":"Projects"},"headers":[],"relativePath":"about/projects.md","filePath":"about/projects.md","lastUpdated":1724226141000}'),s={name:"about/projects.md"},c={id:"frontmatter-title",tabindex:"-1"},d=a("a",{class:"header-anchor",href:"#frontmatter-title","aria-label":'Permalink to "{{ $frontmatter.title }}"'},"​",-1),h=l('

OCR4all-libraries

OCR4all-libraries – Full-Text Transformation of Historical Collections (DFG, 2021-23)

Camerarius digital

Project homepage (DFG, 2021-24)

Narragonien digital

Project homepage (BMBF, 2014-2019)

',6);function p(e,_,g,m,f,u){return n(),t("div",null,[a("h1",c,[i(o(e.$frontmatter.title)+" ",1),d]),h])}const j=r(s,[["render",p]]);export{P as __pageData,j as default}; +import{_ as r,c as t,j as a,a as i,t as o,a4 as l,o as n}from"./chunks/framework.CI6U-QuP.js";const P=JSON.parse('{"title":"Projects","description":"","frontmatter":{"title":"Projects"},"headers":[],"relativePath":"about/projects.md","filePath":"about/projects.md","lastUpdated":1724832369000}'),s={name:"about/projects.md"},c={id:"frontmatter-title",tabindex:"-1"},d=a("a",{class:"header-anchor",href:"#frontmatter-title","aria-label":'Permalink to "{{ $frontmatter.title }}"'},"​",-1),h=l('

OCR4all-libraries

OCR4all-libraries – Full-Text Transformation of Historical Collections (DFG, 2021-23)

Camerarius digital

Project homepage (DFG, 2021-24)

Narragonien digital

Project homepage (BMBF, 2014-2019)

',6);function p(e,_,g,m,f,u){return n(),t("div",null,[a("h1",c,[i(o(e.$frontmatter.title)+" ",1),d]),h])}const j=r(s,[["render",p]]);export{P as __pageData,j as default}; diff --git a/assets/about_projects.md.Cd7nA-A5.lean.js b/assets/about_projects.md.BN7d1s9v.lean.js similarity index 90% rename from assets/about_projects.md.Cd7nA-A5.lean.js rename to assets/about_projects.md.BN7d1s9v.lean.js index bc32d9ff..b8bb619f 100644 --- a/assets/about_projects.md.Cd7nA-A5.lean.js +++ b/assets/about_projects.md.BN7d1s9v.lean.js @@ -1 +1 @@ -import{_ as r,c as t,j as a,a as i,t as o,a4 as l,o as n}from"./chunks/framework.CI6U-QuP.js";const P=JSON.parse('{"title":"Projects","description":"","frontmatter":{"title":"Projects"},"headers":[],"relativePath":"about/projects.md","filePath":"about/projects.md","lastUpdated":1724226141000}'),s={name:"about/projects.md"},c={id:"frontmatter-title",tabindex:"-1"},d=a("a",{class:"header-anchor",href:"#frontmatter-title","aria-label":'Permalink to "{{ $frontmatter.title }}"'},"​",-1),h=l("",6);function p(e,_,g,m,f,u){return n(),t("div",null,[a("h1",c,[i(o(e.$frontmatter.title)+" ",1),d]),h])}const j=r(s,[["render",p]]);export{P as __pageData,j as default}; +import{_ as r,c as t,j as a,a as i,t as o,a4 as l,o as n}from"./chunks/framework.CI6U-QuP.js";const P=JSON.parse('{"title":"Projects","description":"","frontmatter":{"title":"Projects"},"headers":[],"relativePath":"about/projects.md","filePath":"about/projects.md","lastUpdated":1724832369000}'),s={name:"about/projects.md"},c={id:"frontmatter-title",tabindex:"-1"},d=a("a",{class:"header-anchor",href:"#frontmatter-title","aria-label":'Permalink to "{{ $frontmatter.title }}"'},"​",-1),h=l("",6);function p(e,_,g,m,f,u){return n(),t("div",null,[a("h1",c,[i(o(e.$frontmatter.title)+" ",1),d]),h])}const j=r(s,[["render",p]]);export{P as __pageData,j as default}; diff --git a/assets/about_team.md.DJARzRrq.js b/assets/about_team.md.S7gO9Vzt.js similarity index 97% rename from assets/about_team.md.DJARzRrq.js rename to assets/about_team.md.S7gO9Vzt.js index 4004c9b0..89bf0f0d 100644 --- a/assets/about_team.md.DJARzRrq.js +++ b/assets/about_team.md.S7gO9Vzt.js @@ -1 +1 @@ -import{_ as r,c as t,j as e,a as l,t as i,a4 as n,o}from"./chunks/framework.CI6U-QuP.js";const C=JSON.parse('{"title":"Team","description":"","frontmatter":{"title":"Team"},"headers":[],"relativePath":"about/team.md","filePath":"about/team.md","lastUpdated":1724226141000}'),s={name:"about/team.md"},d={id:"frontmatter-title",tabindex:"-1"},u=e("a",{class:"header-anchor",href:"#frontmatter-title","aria-label":'Permalink to "{{ $frontmatter.title }}"'},"​",-1),c=n('

Project lead

  • Dr. Christian Reul 📧

User support

  • Florian Langhanki 📧

Development

  • Dr. Herbert Baier Saip (OCR4all back end) 📧
  • Maximilian Nöth (OCR4all front end, LAREX and distribution) 📧
  • Kevin Chadbourne (LAREX)
  • Andreas Büttner (Calamari)

Miscellaneous

  • Prof. Dr. Frank Puppe (Funding, Ideas and Feedback)
  • Dr. Uwe Springmann (Ideas and Feedback)
  • Kristof Korwisi (Usability)
  • Raphaëlle Jung (Guides and Illustrations)

Former project staff

  • Nico Balbach (OCR4all and LAREX)
  • Dennis Christ (OCR4all)
  • Annika Müller (User support)
  • Björn Eyselein (Artifactory and distribution via Docker)
  • Christine Grundig (Ideas and Feedback)
  • Alexander Hartelt (OCR4all)
  • Yannik Herbst (OCR4all and distribution via VirtualBox)
  • Isabel Müller (Website)
  • Maximilian Wehner (User Support)
  • Dr. Christoph Wick (Calamari)
',10);function h(a,m,b,f,p,_){return o(),t("div",null,[e("h1",d,[l(i(a.$frontmatter.title)+" ",1),u]),c])}const g=r(s,[["render",h]]);export{C as __pageData,g as default}; +import{_ as r,c as t,j as e,a as l,t as i,a4 as n,o}from"./chunks/framework.CI6U-QuP.js";const C=JSON.parse('{"title":"Team","description":"","frontmatter":{"title":"Team"},"headers":[],"relativePath":"about/team.md","filePath":"about/team.md","lastUpdated":1724832369000}'),s={name:"about/team.md"},d={id:"frontmatter-title",tabindex:"-1"},u=e("a",{class:"header-anchor",href:"#frontmatter-title","aria-label":'Permalink to "{{ $frontmatter.title }}"'},"​",-1),c=n('

Project lead

  • Dr. Christian Reul 📧

User support

  • Florian Langhanki 📧

Development

  • Dr. Herbert Baier Saip (OCR4all back end) 📧
  • Maximilian Nöth (OCR4all front end, LAREX and distribution) 📧
  • Kevin Chadbourne (LAREX)
  • Andreas Büttner (Calamari)

Miscellaneous

  • Prof. Dr. Frank Puppe (Funding, Ideas and Feedback)
  • Dr. Uwe Springmann (Ideas and Feedback)
  • Kristof Korwisi (Usability)
  • Raphaëlle Jung (Guides and Illustrations)

Former project staff

  • Nico Balbach (OCR4all and LAREX)
  • Dennis Christ (OCR4all)
  • Annika Müller (User support)
  • Björn Eyselein (Artifactory and distribution via Docker)
  • Christine Grundig (Ideas and Feedback)
  • Alexander Hartelt (OCR4all)
  • Yannik Herbst (OCR4all and distribution via VirtualBox)
  • Isabel Müller (Website)
  • Maximilian Wehner (User Support)
  • Dr. Christoph Wick (Calamari)
',10);function h(a,m,b,f,p,_){return o(),t("div",null,[e("h1",d,[l(i(a.$frontmatter.title)+" ",1),u]),c])}const g=r(s,[["render",h]]);export{C as __pageData,g as default}; diff --git a/assets/about_team.md.DJARzRrq.lean.js b/assets/about_team.md.S7gO9Vzt.lean.js similarity index 90% rename from assets/about_team.md.DJARzRrq.lean.js rename to assets/about_team.md.S7gO9Vzt.lean.js index 240050fd..1f21b2a7 100644 --- a/assets/about_team.md.DJARzRrq.lean.js +++ b/assets/about_team.md.S7gO9Vzt.lean.js @@ -1 +1 @@ -import{_ as r,c as t,j as e,a as l,t as i,a4 as n,o}from"./chunks/framework.CI6U-QuP.js";const C=JSON.parse('{"title":"Team","description":"","frontmatter":{"title":"Team"},"headers":[],"relativePath":"about/team.md","filePath":"about/team.md","lastUpdated":1724226141000}'),s={name:"about/team.md"},d={id:"frontmatter-title",tabindex:"-1"},u=e("a",{class:"header-anchor",href:"#frontmatter-title","aria-label":'Permalink to "{{ $frontmatter.title }}"'},"​",-1),c=n("",10);function h(a,m,b,f,p,_){return o(),t("div",null,[e("h1",d,[l(i(a.$frontmatter.title)+" ",1),u]),c])}const g=r(s,[["render",h]]);export{C as __pageData,g as default}; +import{_ as r,c as t,j as e,a as l,t as i,a4 as n,o}from"./chunks/framework.CI6U-QuP.js";const C=JSON.parse('{"title":"Team","description":"","frontmatter":{"title":"Team"},"headers":[],"relativePath":"about/team.md","filePath":"about/team.md","lastUpdated":1724832369000}'),s={name:"about/team.md"},d={id:"frontmatter-title",tabindex:"-1"},u=e("a",{class:"header-anchor",href:"#frontmatter-title","aria-label":'Permalink to "{{ $frontmatter.title }}"'},"​",-1),c=n("",10);function h(a,m,b,f,p,_){return o(),t("div",null,[e("h1",d,[l(i(a.$frontmatter.title)+" ",1),u]),c])}const g=r(s,[["render",h]]);export{C as __pageData,g as default}; diff --git a/assets/beta_index.md.CMmb8dlA.js b/assets/beta_index.md.CMmb8dlA.js new file mode 100644 index 00000000..45ccb0e1 --- /dev/null +++ b/assets/beta_index.md.CMmb8dlA.js @@ -0,0 +1 @@ +import{_ as a,c as r,j as e,a as o,t as n,a4 as i,o as l}from"./chunks/framework.CI6U-QuP.js";const g=JSON.parse('{"title":"OCR4all 1.0","description":"","frontmatter":{"title":"OCR4all 1.0","next":{"text":"Introduction","link":"/beta/introduction"}},"headers":[],"relativePath":"beta/index.md","filePath":"beta/index.md","lastUpdated":1724832369000}'),s={name:"beta/index.md"},d={id:"frontmatter-title",tabindex:"-1"},u=e("a",{class:"header-anchor",href:"#frontmatter-title","aria-label":'Permalink to "{{ $frontmatter.title }}"'},"​",-1),c=i('

Core Features

  • 👥 User and group management. Share your all your projects and files with other users and groups on the same instance.
  • ⚙️ Wide array of OCR processor, powered by OCR-D and others.
  • 🗂️ Fully fledged in-app data management. Upload and manage your images, models, workflows and datasets completely through the UI.
  • 📥 Export all uploaded and generated data, import them into other OCR4all instances or use them wherever you want.
  • 👑 Full data sovereignty. No data leaves your instance unless approved by you or the instance administrator.
  • 💪 Generate training data and use it to train or fine-tune models.
  • 🆓 OCR4all is and will always stay free and open-source.
  • and much more...

Next steps

',4);function h(t,p,m,_,f,b){return l(),r("div",null,[e("h1",d,[o(n(t.$frontmatter.title)+" ",1),u]),c])}const y=a(s,[["render",h]]);export{g as __pageData,y as default}; diff --git a/assets/beta_index.md.CMmb8dlA.lean.js b/assets/beta_index.md.CMmb8dlA.lean.js new file mode 100644 index 00000000..22468bd6 --- /dev/null +++ b/assets/beta_index.md.CMmb8dlA.lean.js @@ -0,0 +1 @@ +import{_ as a,c as r,j as e,a as o,t as n,a4 as i,o as l}from"./chunks/framework.CI6U-QuP.js";const g=JSON.parse('{"title":"OCR4all 1.0","description":"","frontmatter":{"title":"OCR4all 1.0","next":{"text":"Introduction","link":"/beta/introduction"}},"headers":[],"relativePath":"beta/index.md","filePath":"beta/index.md","lastUpdated":1724832369000}'),s={name:"beta/index.md"},d={id:"frontmatter-title",tabindex:"-1"},u=e("a",{class:"header-anchor",href:"#frontmatter-title","aria-label":'Permalink to "{{ $frontmatter.title }}"'},"​",-1),c=i("",4);function h(t,p,m,_,f,b){return l(),r("div",null,[e("h1",d,[o(n(t.$frontmatter.title)+" ",1),u]),c])}const y=a(s,[["render",h]]);export{g as __pageData,y as default}; diff --git a/assets/beta_introduction.md.CVgK8JtF.js b/assets/beta_introduction.md.CVgK8JtF.js new file mode 100644 index 00000000..db45164a --- /dev/null +++ b/assets/beta_introduction.md.CVgK8JtF.js @@ -0,0 +1 @@ +import{_ as a,c as i,j as e,a as o,t as r,a4 as n,o as s}from"./chunks/framework.CI6U-QuP.js";const y=JSON.parse('{"title":"OCR4all 1.0 – Introduction","description":"","frontmatter":{"title":"OCR4all 1.0 – Introduction","next":{"text":"Setup Beta","link":"/beta/setup"}},"headers":[],"relativePath":"beta/introduction.md","filePath":"beta/introduction.md","lastUpdated":1724832369000}'),l={name:"beta/introduction.md"},d={id:"frontmatter-title",tabindex:"-1"},c=e("a",{class:"header-anchor",href:"#frontmatter-title","aria-label":'Permalink to "{{ $frontmatter.title }}"'},"​",-1),u=n('

Motivation and General Idea

  • Availability of Solutions: Numerous high-performance open-source solutions for Automatic Text Recognition (ATR) are already available, with new releases emerging continuously.
  • Diverse Use Cases: The highly heterogeneous nature of use cases necessitates the targeted deployment of specialized ATR solutions.
  • Requirement: There is a need for user-friendly frameworks that facilitate the flexible, integrable, and sustainable combination and application of both existing and future ATR solutions.
  • Objective: Our goal is to empower users to perform ATR independently, achieving high-quality results.
  • Foundation: This framework is built upon freely available tools, enhanced by our in-house developments.

OCR-D and OCR4all

  • OCR-D Initiative: The DFG-funded OCR-D initiative is dedicated to facilitating the mass full-text transformation of historical prints published in the German-speaking world.
  • Focus Areas: OCR-D emphasizes interoperability and connectivity, ensuring a high degree of flexibility and sustainability in its solutions.
  • Integrated Solutions: The initiative combines multiple ATR solutions within a unified framework, enabling precise adaptation to specific materials and use cases.
  • Open Source Commitment: All results from the OCR-D project are released as completely open-source.
  • OCR4all-Libraries Project: The DFG-funded OCR4all-libraries project has two primary goals:
    • Providing a user-friendly interface for OCR-D solutions via OCR4all, enabling independent use by non-technical users.
    • Enhancing the ATR output within OCR4all to offer added value to even the most technically experienced users.

System Architecture

  • Modularity and Interoperability: The framework is designed with a strong focus on modularity and interoperability, ensuring seamless integration and adaptability.
  • Distributed Infrastructure: The architecture features a distributed infrastructure, with a clear separation between the backend and frontend components.
    • Backend: Built with Java and Spring Boot.
    • Frontend: Developed using the Vue.js ecosystem.
  • Component Communication: Components communicate via a REST API, enabling efficient interaction between different parts of the system.
  • Integration of Third-Party Solutions: Service Provider Interfaces (SPIs) allow for the integration of third-party solutions, such as ATR processors.
  • Containerized Setup: The containerized architecture ensures easy distribution and deployment of all integrated components with minimal barriers.
  • Data Sovereignty: Users retain full control over their data, with no data leaving the instance without explicit user or administrator consent.
  • Reproducibility: Every step in the process is fully reproducible. A "transcript of records" feature stores detailed information about the processors and parameters used, ensuring transparency and repeatability.

Modules

Data Management and Processing

  • Separation of Functions: Data management and processing are strictly separated to ensure efficient handling and security.
  • Data Sharing: Data can be shared with different users or user groups as needed.

Processors and NodeFlow

  • Wide Array of Processors: A diverse range of ATR processors is available, including OCR-D and external options.
  • Ease of Integration: New processors can be easily implemented via a well-defined interface, with the user interface generated automatically.
  • NodeFlow: The graphical editor NodeFlow simplifies the creation of workflows, making it convenient for users to design and customize processing sequences.

LAREX

  • Result Correction and Training Data Creation: LAREX allows for the correction of all ATR workflow results and the creation of training data.
  • Visual Workflow Identification: LAREX helps users identify the most suitable workflows as a visual explanation component.

Datasets, Training, and Evaluation

  • Dataset Creation: Datasets can be created with the option to use tagging and import functionalities.
  • Dataset Enrichment: Datasets can be enriched with training data generated and tagged within the application, even across various projects and sources.
  • Model Training: Train models on selected datasets or subsets thereof, with options for in-app usage or exporting both models and associated training data.
  • Model Evaluation: Evaluate both trained and imported models using curated datasets to ensure quality and accuracy.

Working with OCR4all 1.0

One Tool, Two Modes

Base ModePro Mode
Designed for novice users, with reduced complexity and a strongly guided, linear workflowTailored for experienced users who require more exploration and complexity
Pre-selected solutions for each processing stepUnrestricted access to all processors, parameters, and features
Pre-filtered parameters and limited access to advanced featuresSupport for identifying the best workflows and models for specific needs

INFO

Currently only pro mode is available in the beta release. The base mode will be added shortly.

Example Use Cases and Application Scenarios

Fully Automatic Mass Full-Text Digitalization

  • Objective: Maximize throughput with minimal manual effort.
  • Users: Libraries and archives processing large volumes of scanned materials.
  • Approach: Use the pro mode (NodeFlow, LAREX, and datasets) to identify the most suitable workflow.

Flawless Transcription of Source Material

  • Objective: Achieve maximum quality, accepting significant manual effort.
  • Users: Humanist researchers preparing text for a digital edition.
  • Approach: Utilize the base mode for iterative transcription with continually improving accuracy.

Building Corpora for Quantitative Applications

  • Objective: Maximize quality while minimizing manual effort.
  • Users: Researchers constructing corpora for training and evaluating quantitative methods.
  • Approach: Manage data and consistently retrain source-specific or mixed models using datasets and tagging functionalities.
',26);function h(t,g,p,f,m,b){return s(),i("div",null,[e("h1",d,[o(r(t.$frontmatter.title)+" ",1),c]),u])}const v=a(l,[["render",h]]);export{y as __pageData,v as default}; diff --git a/assets/beta_introduction.md.CVgK8JtF.lean.js b/assets/beta_introduction.md.CVgK8JtF.lean.js new file mode 100644 index 00000000..2d8c0f70 --- /dev/null +++ b/assets/beta_introduction.md.CVgK8JtF.lean.js @@ -0,0 +1 @@ +import{_ as a,c as i,j as e,a as o,t as r,a4 as n,o as s}from"./chunks/framework.CI6U-QuP.js";const y=JSON.parse('{"title":"OCR4all 1.0 – Introduction","description":"","frontmatter":{"title":"OCR4all 1.0 – Introduction","next":{"text":"Setup Beta","link":"/beta/setup"}},"headers":[],"relativePath":"beta/introduction.md","filePath":"beta/introduction.md","lastUpdated":1724832369000}'),l={name:"beta/introduction.md"},d={id:"frontmatter-title",tabindex:"-1"},c=e("a",{class:"header-anchor",href:"#frontmatter-title","aria-label":'Permalink to "{{ $frontmatter.title }}"'},"​",-1),u=n("",26);function h(t,g,p,f,m,b){return s(),i("div",null,[e("h1",d,[o(r(t.$frontmatter.title)+" ",1),c]),u])}const v=a(l,[["render",h]]);export{y as __pageData,v as default}; diff --git a/assets/beta_setup.md.CyQ7iTlu.js b/assets/beta_setup.md.CyQ7iTlu.js new file mode 100644 index 00000000..506ab02b --- /dev/null +++ b/assets/beta_setup.md.CyQ7iTlu.js @@ -0,0 +1,58 @@ +import{_ as n,c as p,j as s,a as e,t as l,a4 as t,o}from"./chunks/framework.CI6U-QuP.js";const h=JSON.parse('{"title":"OCR4all 1.0 – Setup","description":"","frontmatter":{"title":"OCR4all 1.0 – Setup"},"headers":[],"relativePath":"beta/setup.md","filePath":"beta/setup.md","lastUpdated":1724832369000}'),c={name:"beta/setup.md"},r={id:"frontmatter-title",tabindex:"-1"},i=s("a",{class:"header-anchor",href:"#frontmatter-title","aria-label":'Permalink to "{{ $frontmatter.title }}"'},"​",-1),d=t(`

If you want to try out the beta version of release 1.0 of OCR4all you can simply use the following Docker Compose file or download it here.

The prerequisite for this is having both Docker and Docker Compose installed.

A more in-depth installation guide will follow with the stable release of OCR4all 1.0.

WARNING

This will install a beta version of OCR4all 1.0 which may still contain some bugs and many features are yet to come.

version: "3.9"
+
+services:
+  msa-calamari:
+    hostname: msa-calamari
+    build:
+      context: ocr4all-app-calamari-msa
+      dockerfile: Dockerfile
+      args:
+        - TAG=\${CALAMARI_TAG:-20240502}
+        - JAVA_VERSION=\${CALAMARI_JAVA_VERSION:-17}
+        - APP_VERSION=\${CALAMARI_APP_VERSION:-1.0-SNAPSHOT}
+    user: "\${UID:-}"
+    restart: always
+    environment:
+      - SPRING_PROFILES_ACTIVE=\${CALAMARI_PROFILES:-logging-debug,msa-api,docker}
+    volumes:
+      - \${OCR4ALL_DATA:-~/ocr4all/docker/data}:/srv/ocr4all/data
+      - \${OCR4ALL_ASSEMBLE:-~/ocr4all/docker/assemble}:/srv/ocr4all/assemble
+      - \${OCR4ALL_WORKSPACE_PROJECT:-~/ocr4all/docker/workspace/projects}:/srv/ocr4all/projects
+    ports:
+      - "\${CALAMARI_API_PORT:-127.0.0.1:9092}:8080"
+  msa-ocrd:
+    hostname: msa-ocrd
+    build:
+      context: ocr4all-app-ocrd-msa
+      dockerfile: Dockerfile
+      args:
+        - TAG=\${OCRD_TAG:-2024-04-29}
+        - JAVA_VERSION=\${OCRD_JAVA_VERSION:-17}
+        - APP_VERSION=\${OCRD_APP_VERSION:-1.0-SNAPSHOT}
+    user: "\${UID:-}"
+    restart: always
+    environment:
+      - SPRING_PROFILES_ACTIVE=\${OCRD_PROFILES:-logging-debug,msa-api,docker}
+    volumes:
+      - \${OCR4ALL_WORKSPACE_PROJECT:-~/ocr4all/docker/workspace/projects}:/srv/ocr4all/projects
+      - \${OCR4ALL_RESOURCES_ORCD:-~/ocr4all/docker/opt/ocr-d/resources}:/usr/local/share/ocrd-resources
+    ports:
+      - "\${OCRD_API_PORT:-127.0.0.1:9091}:8080"
+  server:
+     build:
+      context: ocr4all-app
+      dockerfile: Dockerfile
+      args:
+        - TAG=\${OCR4ALL_TAG:-17-jdk-slim}
+        - APP_VERSION=\${OCR4ALL_APP_VERSION:-1.0-SNAPSHOT}
+    user: "\${UID:-}"
+    restart: always
+    environment:
+      - SPRING_PROFILES_ACTIVE=\${OCR4ALL_PROFILES:-logging-debug,server,api,documentation,docker}
+    volumes:
+      - \${OCR4ALL_HOME:-~/ocr4all/docker}:/srv/ocr4all
+    ports:
+      - "\${OCR4ALL_API_PORT:-9090}:8080"
+    depends_on:
+      - msa-calamari
+      - msa-ocrd
`,5);function _(a,A,u,R,m,O){return o(),p("div",null,[s("h1",r,[e(l(a.$frontmatter.title)+" ",1),i]),d])}const P=n(c,[["render",_]]);export{h as __pageData,P as default}; diff --git a/assets/beta_setup.md.CyQ7iTlu.lean.js b/assets/beta_setup.md.CyQ7iTlu.lean.js new file mode 100644 index 00000000..12752741 --- /dev/null +++ b/assets/beta_setup.md.CyQ7iTlu.lean.js @@ -0,0 +1 @@ +import{_ as n,c as p,j as s,a as e,t as l,a4 as t,o}from"./chunks/framework.CI6U-QuP.js";const h=JSON.parse('{"title":"OCR4all 1.0 – Setup","description":"","frontmatter":{"title":"OCR4all 1.0 – Setup"},"headers":[],"relativePath":"beta/setup.md","filePath":"beta/setup.md","lastUpdated":1724832369000}'),c={name:"beta/setup.md"},r={id:"frontmatter-title",tabindex:"-1"},i=s("a",{class:"header-anchor",href:"#frontmatter-title","aria-label":'Permalink to "{{ $frontmatter.title }}"'},"​",-1),d=t("",5);function _(a,A,u,R,m,O){return o(),p("div",null,[s("h1",r,[e(l(a.$frontmatter.title)+" ",1),i]),d])}const P=n(c,[["render",_]]);export{h as __pageData,P as default}; diff --git a/assets/guide_setup-guide_linux.md.Dw_nI6Bl.js b/assets/guide_setup-guide_linux.md.BkF1FTXB.js similarity index 99% rename from assets/guide_setup-guide_linux.md.Dw_nI6Bl.js rename to assets/guide_setup-guide_linux.md.BkF1FTXB.js index e19207dd..5b5d5d5c 100644 --- a/assets/guide_setup-guide_linux.md.Dw_nI6Bl.js +++ b/assets/guide_setup-guide_linux.md.BkF1FTXB.js @@ -1,4 +1,4 @@ -import{_ as s,c as n,j as e,a as t,t as l,a4 as o,o as i}from"./chunks/framework.CI6U-QuP.js";const v=JSON.parse('{"title":"Linux","description":"","frontmatter":{"lang":"en-US","title":"Linux"},"headers":[],"relativePath":"guide/setup-guide/linux.md","filePath":"guide/setup-guide/linux.md","lastUpdated":1724226141000}'),r={name:"guide/setup-guide/linux.md"},p={id:"setup-guide-–-frontmatter-title",tabindex:"-1"},c=e("a",{class:"header-anchor",href:"#setup-guide-–-frontmatter-title","aria-label":'Permalink to "Setup guide – {{ $frontmatter.title }}"'},"​",-1),d=o(`

Preparation

You have to prepare the following folder structure:

...
+import{_ as s,c as n,j as e,a as t,t as l,a4 as o,o as i}from"./chunks/framework.CI6U-QuP.js";const v=JSON.parse('{"title":"Linux","description":"","frontmatter":{"lang":"en-US","title":"Linux"},"headers":[],"relativePath":"guide/setup-guide/linux.md","filePath":"guide/setup-guide/linux.md","lastUpdated":1724832369000}'),r={name:"guide/setup-guide/linux.md"},p={id:"setup-guide-–-frontmatter-title",tabindex:"-1"},c=e("a",{class:"header-anchor",href:"#setup-guide-–-frontmatter-title","aria-label":'Permalink to "Setup guide – {{ $frontmatter.title }}"'},"​",-1),d=o(`

Preparation

You have to prepare the following folder structure:

...
 ├── ocr4all
 │   ├── data
 │   |   ├── [Your book]
diff --git a/assets/guide_setup-guide_linux.md.Dw_nI6Bl.lean.js b/assets/guide_setup-guide_linux.md.BkF1FTXB.lean.js
similarity index 91%
rename from assets/guide_setup-guide_linux.md.Dw_nI6Bl.lean.js
rename to assets/guide_setup-guide_linux.md.BkF1FTXB.lean.js
index b9c18fbc..1997fa83 100644
--- a/assets/guide_setup-guide_linux.md.Dw_nI6Bl.lean.js
+++ b/assets/guide_setup-guide_linux.md.BkF1FTXB.lean.js
@@ -1 +1 @@
-import{_ as s,c as n,j as e,a as t,t as l,a4 as o,o as i}from"./chunks/framework.CI6U-QuP.js";const v=JSON.parse('{"title":"Linux","description":"","frontmatter":{"lang":"en-US","title":"Linux"},"headers":[],"relativePath":"guide/setup-guide/linux.md","filePath":"guide/setup-guide/linux.md","lastUpdated":1724226141000}'),r={name:"guide/setup-guide/linux.md"},p={id:"setup-guide-–-frontmatter-title",tabindex:"-1"},c=e("a",{class:"header-anchor",href:"#setup-guide-–-frontmatter-title","aria-label":'Permalink to "Setup guide – {{ $frontmatter.title }}"'},"​",-1),d=o("",18);function u(a,h,g,m,b,k){return i(),n("div",null,[e("h1",p,[t("Setup guide – "+l(a.$frontmatter.title)+" ",1),c]),d])}const y=s(r,[["render",u]]);export{v as __pageData,y as default};
+import{_ as s,c as n,j as e,a as t,t as l,a4 as o,o as i}from"./chunks/framework.CI6U-QuP.js";const v=JSON.parse('{"title":"Linux","description":"","frontmatter":{"lang":"en-US","title":"Linux"},"headers":[],"relativePath":"guide/setup-guide/linux.md","filePath":"guide/setup-guide/linux.md","lastUpdated":1724832369000}'),r={name:"guide/setup-guide/linux.md"},p={id:"setup-guide-–-frontmatter-title",tabindex:"-1"},c=e("a",{class:"header-anchor",href:"#setup-guide-–-frontmatter-title","aria-label":'Permalink to "Setup guide – {{ $frontmatter.title }}"'},"​",-1),d=o("",18);function u(a,h,g,m,b,k){return i(),n("div",null,[e("h1",p,[t("Setup guide – "+l(a.$frontmatter.title)+" ",1),c]),d])}const y=s(r,[["render",u]]);export{v as __pageData,y as default};
diff --git a/assets/guide_setup-guide_macos.md.3kOmZ-Mw.js b/assets/guide_setup-guide_macos.md.CZq3J_KB.js
similarity index 99%
rename from assets/guide_setup-guide_macos.md.3kOmZ-Mw.js
rename to assets/guide_setup-guide_macos.md.CZq3J_KB.js
index 26212447..fe8be1e6 100644
--- a/assets/guide_setup-guide_macos.md.3kOmZ-Mw.js
+++ b/assets/guide_setup-guide_macos.md.CZq3J_KB.js
@@ -1,4 +1,4 @@
-import{_ as s,c as t,j as a,a as n,t as l,a4 as o,o as i}from"./chunks/framework.CI6U-QuP.js";const v=JSON.parse('{"title":"macOS","description":"","frontmatter":{"lang":"en-US","title":"macOS"},"headers":[],"relativePath":"guide/setup-guide/macos.md","filePath":"guide/setup-guide/macos.md","lastUpdated":1724226141000}'),r={name:"guide/setup-guide/macos.md"},p={id:"setup-guide-–-frontmatter-title",tabindex:"-1"},c=a("a",{class:"header-anchor",href:"#setup-guide-–-frontmatter-title","aria-label":'Permalink to "Setup Guide – {{ $frontmatter.title }}"'},"​",-1),d=o(`

Preparation

You have to prepare the following folder structure:

...
+import{_ as s,c as t,j as a,a as n,t as l,a4 as o,o as i}from"./chunks/framework.CI6U-QuP.js";const v=JSON.parse('{"title":"macOS","description":"","frontmatter":{"lang":"en-US","title":"macOS"},"headers":[],"relativePath":"guide/setup-guide/macos.md","filePath":"guide/setup-guide/macos.md","lastUpdated":1724832369000}'),r={name:"guide/setup-guide/macos.md"},p={id:"setup-guide-–-frontmatter-title",tabindex:"-1"},c=a("a",{class:"header-anchor",href:"#setup-guide-–-frontmatter-title","aria-label":'Permalink to "Setup Guide – {{ $frontmatter.title }}"'},"​",-1),d=o(`

Preparation

You have to prepare the following folder structure:

...
 ├── ocr4all
 │   ├── data
 │   |   ├── [Your book]
diff --git a/assets/guide_setup-guide_macos.md.3kOmZ-Mw.lean.js b/assets/guide_setup-guide_macos.md.CZq3J_KB.lean.js
similarity index 91%
rename from assets/guide_setup-guide_macos.md.3kOmZ-Mw.lean.js
rename to assets/guide_setup-guide_macos.md.CZq3J_KB.lean.js
index 326059c8..f35f45dd 100644
--- a/assets/guide_setup-guide_macos.md.3kOmZ-Mw.lean.js
+++ b/assets/guide_setup-guide_macos.md.CZq3J_KB.lean.js
@@ -1 +1 @@
-import{_ as s,c as t,j as a,a as n,t as l,a4 as o,o as i}from"./chunks/framework.CI6U-QuP.js";const v=JSON.parse('{"title":"macOS","description":"","frontmatter":{"lang":"en-US","title":"macOS"},"headers":[],"relativePath":"guide/setup-guide/macos.md","filePath":"guide/setup-guide/macos.md","lastUpdated":1724226141000}'),r={name:"guide/setup-guide/macos.md"},p={id:"setup-guide-–-frontmatter-title",tabindex:"-1"},c=a("a",{class:"header-anchor",href:"#setup-guide-–-frontmatter-title","aria-label":'Permalink to "Setup Guide – {{ $frontmatter.title }}"'},"​",-1),d=o("",18);function u(e,h,m,g,b,k){return i(),t("div",null,[a("h1",p,[n("Setup Guide – "+l(e.$frontmatter.title)+" ",1),c]),d])}const y=s(r,[["render",u]]);export{v as __pageData,y as default};
+import{_ as s,c as t,j as a,a as n,t as l,a4 as o,o as i}from"./chunks/framework.CI6U-QuP.js";const v=JSON.parse('{"title":"macOS","description":"","frontmatter":{"lang":"en-US","title":"macOS"},"headers":[],"relativePath":"guide/setup-guide/macos.md","filePath":"guide/setup-guide/macos.md","lastUpdated":1724832369000}'),r={name:"guide/setup-guide/macos.md"},p={id:"setup-guide-–-frontmatter-title",tabindex:"-1"},c=a("a",{class:"header-anchor",href:"#setup-guide-–-frontmatter-title","aria-label":'Permalink to "Setup Guide – {{ $frontmatter.title }}"'},"​",-1),d=o("",18);function u(e,h,m,g,b,k){return i(),t("div",null,[a("h1",p,[n("Setup Guide – "+l(e.$frontmatter.title)+" ",1),c]),d])}const y=s(r,[["render",u]]);export{v as __pageData,y as default};
diff --git a/assets/guide_setup-guide_quickstart.md.DYx-lcYS.js b/assets/guide_setup-guide_quickstart.md.Bl7ugQhE.js
similarity index 96%
rename from assets/guide_setup-guide_quickstart.md.DYx-lcYS.js
rename to assets/guide_setup-guide_quickstart.md.Bl7ugQhE.js
index efd16630..7acf2ed5 100644
--- a/assets/guide_setup-guide_quickstart.md.DYx-lcYS.js
+++ b/assets/guide_setup-guide_quickstart.md.Bl7ugQhE.js
@@ -1,4 +1,4 @@
-import{_ as t,c as s,j as e,a as n,t as i,a4 as l,o}from"./chunks/framework.CI6U-QuP.js";const b=JSON.parse('{"title":"Quickstart","description":"","frontmatter":{"lang":"en-US","title":"Quickstart"},"headers":[],"relativePath":"guide/setup-guide/quickstart.md","filePath":"guide/setup-guide/quickstart.md","lastUpdated":1724226141000}'),r={name:"guide/setup-guide/quickstart.md"},d={id:"frontmatter-title",tabindex:"-1"},c=e("a",{class:"header-anchor",href:"#frontmatter-title","aria-label":'Permalink to "{{ $frontmatter.title }}"'},"​",-1),p=l(`
sudo docker run -p 1476:8080 \\
+import{_ as t,c as s,j as e,a as n,t as i,a4 as l,o}from"./chunks/framework.CI6U-QuP.js";const b=JSON.parse('{"title":"Quickstart","description":"","frontmatter":{"lang":"en-US","title":"Quickstart"},"headers":[],"relativePath":"guide/setup-guide/quickstart.md","filePath":"guide/setup-guide/quickstart.md","lastUpdated":1724832369000}'),r={name:"guide/setup-guide/quickstart.md"},d={id:"frontmatter-title",tabindex:"-1"},c=e("a",{class:"header-anchor",href:"#frontmatter-title","aria-label":'Permalink to "{{ $frontmatter.title }}"'},"​",-1),p=l(`
sudo docker run -p 1476:8080 \\
     -u \`id -u root\`:\`id -g $USER\` \\
     --name ocr4all \\
     -v $PWD/data:/var/ocr4all/data \\
diff --git a/assets/guide_setup-guide_quickstart.md.DYx-lcYS.lean.js b/assets/guide_setup-guide_quickstart.md.Bl7ugQhE.lean.js
similarity index 91%
rename from assets/guide_setup-guide_quickstart.md.DYx-lcYS.lean.js
rename to assets/guide_setup-guide_quickstart.md.Bl7ugQhE.lean.js
index 1866a608..206ae6fc 100644
--- a/assets/guide_setup-guide_quickstart.md.DYx-lcYS.lean.js
+++ b/assets/guide_setup-guide_quickstart.md.Bl7ugQhE.lean.js
@@ -1 +1 @@
-import{_ as t,c as s,j as e,a as n,t as i,a4 as l,o}from"./chunks/framework.CI6U-QuP.js";const b=JSON.parse('{"title":"Quickstart","description":"","frontmatter":{"lang":"en-US","title":"Quickstart"},"headers":[],"relativePath":"guide/setup-guide/quickstart.md","filePath":"guide/setup-guide/quickstart.md","lastUpdated":1724226141000}'),r={name:"guide/setup-guide/quickstart.md"},d={id:"frontmatter-title",tabindex:"-1"},c=e("a",{class:"header-anchor",href:"#frontmatter-title","aria-label":'Permalink to "{{ $frontmatter.title }}"'},"​",-1),p=l("",6);function u(a,h,_,m,g,f){return o(),s("div",null,[e("h1",d,[n(i(a.$frontmatter.title)+" ",1),c]),p])}const v=t(r,[["render",u]]);export{b as __pageData,v as default};
+import{_ as t,c as s,j as e,a as n,t as i,a4 as l,o}from"./chunks/framework.CI6U-QuP.js";const b=JSON.parse('{"title":"Quickstart","description":"","frontmatter":{"lang":"en-US","title":"Quickstart"},"headers":[],"relativePath":"guide/setup-guide/quickstart.md","filePath":"guide/setup-guide/quickstart.md","lastUpdated":1724832369000}'),r={name:"guide/setup-guide/quickstart.md"},d={id:"frontmatter-title",tabindex:"-1"},c=e("a",{class:"header-anchor",href:"#frontmatter-title","aria-label":'Permalink to "{{ $frontmatter.title }}"'},"​",-1),p=l("",6);function u(a,h,_,m,g,f){return o(),s("div",null,[e("h1",d,[n(i(a.$frontmatter.title)+" ",1),c]),p])}const v=t(r,[["render",u]]);export{b as __pageData,v as default};
diff --git a/assets/guide_setup-guide_windows.md.DWUfcSce.js b/assets/guide_setup-guide_windows.md.UyjaupDu.js
similarity index 98%
rename from assets/guide_setup-guide_windows.md.DWUfcSce.js
rename to assets/guide_setup-guide_windows.md.UyjaupDu.js
index 607193d9..8154e1c7 100644
--- a/assets/guide_setup-guide_windows.md.DWUfcSce.js
+++ b/assets/guide_setup-guide_windows.md.UyjaupDu.js
@@ -1,4 +1,4 @@
-import{_ as o,c as t,j as e,a as s,t as l,a4 as i,o as n}from"./chunks/framework.CI6U-QuP.js";const w=JSON.parse('{"title":"Windows","description":"","frontmatter":{"lang":"en-US","title":"Windows"},"headers":[],"relativePath":"guide/setup-guide/windows.md","filePath":"guide/setup-guide/windows.md","lastUpdated":1724226141000}'),r={name:"guide/setup-guide/windows.md"},d={id:"setup-guide-–-frontmatter-title",tabindex:"-1"},c=e("a",{class:"header-anchor",href:"#setup-guide-–-frontmatter-title","aria-label":'Permalink to "Setup guide – {{ $frontmatter.title }}"'},"​",-1),p=i(`

Preparation

You have to prepare the following folder structure:

...
+import{_ as o,c as t,j as e,a as s,t as l,a4 as i,o as n}from"./chunks/framework.CI6U-QuP.js";const w=JSON.parse('{"title":"Windows","description":"","frontmatter":{"lang":"en-US","title":"Windows"},"headers":[],"relativePath":"guide/setup-guide/windows.md","filePath":"guide/setup-guide/windows.md","lastUpdated":1724832369000}'),r={name:"guide/setup-guide/windows.md"},d={id:"setup-guide-–-frontmatter-title",tabindex:"-1"},c=e("a",{class:"header-anchor",href:"#setup-guide-–-frontmatter-title","aria-label":'Permalink to "Setup guide – {{ $frontmatter.title }}"'},"​",-1),p=i(`

Preparation

You have to prepare the following folder structure:

...
 ├── ocr4all
 │   ├── data
 │   |   ├── [Your book]
diff --git a/assets/guide_setup-guide_windows.md.DWUfcSce.lean.js b/assets/guide_setup-guide_windows.md.UyjaupDu.lean.js
similarity index 83%
rename from assets/guide_setup-guide_windows.md.DWUfcSce.lean.js
rename to assets/guide_setup-guide_windows.md.UyjaupDu.lean.js
index 4af907f0..36d27513 100644
--- a/assets/guide_setup-guide_windows.md.DWUfcSce.lean.js
+++ b/assets/guide_setup-guide_windows.md.UyjaupDu.lean.js
@@ -1 +1 @@
-import{_ as o,c as t,j as e,a as s,t as l,a4 as i,o as n}from"./chunks/framework.CI6U-QuP.js";const w=JSON.parse('{"title":"Windows","description":"","frontmatter":{"lang":"en-US","title":"Windows"},"headers":[],"relativePath":"guide/setup-guide/windows.md","filePath":"guide/setup-guide/windows.md","lastUpdated":1724226141000}'),r={name:"guide/setup-guide/windows.md"},d={id:"setup-guide-–-frontmatter-title",tabindex:"-1"},c=e("a",{class:"header-anchor",href:"#setup-guide-–-frontmatter-title","aria-label":'Permalink to "Setup guide – {{ $frontmatter.title }}"'},"​",-1),p=i("",21);function h(a,u,g,m,k,f){return n(),t("div",null,[e("h1",d,[s("Setup guide – "+l(a.$frontmatter.title)+" ",1),c]),p])}const v=o(r,[["render",h]]);export{w as __pageData,v as default};
+import{_ as o,c as t,j as e,a as s,t as l,a4 as i,o as n}from"./chunks/framework.CI6U-QuP.js";const w=JSON.parse('{"title":"Windows","description":"","frontmatter":{"lang":"en-US","title":"Windows"},"headers":[],"relativePath":"guide/setup-guide/windows.md","filePath":"guide/setup-guide/windows.md","lastUpdated":1724832369000}'),r={name:"guide/setup-guide/windows.md"},d={id:"setup-guide-–-frontmatter-title",tabindex:"-1"},c=e("a",{class:"header-anchor",href:"#setup-guide-–-frontmatter-title","aria-label":'Permalink to "Setup guide – {{ $frontmatter.title }}"'},"​",-1),p=i("",21);function h(a,u,g,m,k,f){return n(),t("div",null,[e("h1",d,[s("Setup guide – "+l(a.$frontmatter.title)+" ",1),c]),p])}const v=o(r,[["render",h]]);export{w as __pageData,v as default};
diff --git a/assets/guide_user-guide_common-errors.md.CnzWhStX.js b/assets/guide_user-guide_common-errors.md.BlhiEKoZ.js
similarity index 97%
rename from assets/guide_user-guide_common-errors.md.CnzWhStX.js
rename to assets/guide_user-guide_common-errors.md.BlhiEKoZ.js
index 36eacf89..52eef517 100644
--- a/assets/guide_user-guide_common-errors.md.CnzWhStX.js
+++ b/assets/guide_user-guide_common-errors.md.BlhiEKoZ.js
@@ -1 +1 @@
-import{_ as o,c as t,j as e,a,t as i,a4 as n,o as s}from"./chunks/framework.CI6U-QuP.js";const _=JSON.parse('{"title":"Common errors","description":"","frontmatter":{"title":"Common errors"},"headers":[],"relativePath":"guide/user-guide/common-errors.md","filePath":"guide/user-guide/common-errors.md","lastUpdated":1724226141000}'),l={name:"guide/user-guide/common-errors.md"},c={id:"frontmatter-title",tabindex:"-1"},u=e("a",{class:"header-anchor",href:"#frontmatter-title","aria-label":'Permalink to "{{ $frontmatter.title }}"'},"​",-1),d=n('

Warning

This page is still under construction. If you come across any problems please contact us.

Errors, frequent problems and how to avoid them

Problems with the installation and start of Docker:

  • Did you encounter problems while installing and starting Docker: you will find a detailed guide here.
  • Do you have difficulties starting the Docker containers for OCR4all? Is no server start possible? First, start Docker again (if necessary, reload OCR4all image anew and reset the corresponding container, following the steps described in the OCR4all setup guide here).
  • Are you using an Apple device with a M1 / M2 chip? We currently don't offer specific images for these systems but are working on it.

Problems selecting works in 'Project Overview':

  • If available works are not displayed in 'project overview', review your folder structure and check if it is correct, following the guidelines outlined in chapter 1.2. If there is no problem with your folder structure, delete the OCR4all Docker container and re-execute the docker run... command, following the setup guide here.
  • Are you unable to select a work? Please ensure that your work/document title contains no blanks or umlauts.

Problems with Calamari recognition or training:

  • Are you experiencing errors with mentions to AVX? If you're using an old CPU w/o AVX or on a virtual machine where AVX passthrough wasn't enabled you might run into several errors during the process execution as official TensorFlow builds don't offer support for these systems.

We welcome all questions and encourage to contact us if you have any problem. Please send an email (consultation, guides, and non-technical user support) or contact us on GitHub.

',9);function h(r,m,p,f,g,b){return s(),t("div",null,[e("h1",c,[a(i(r.$frontmatter.title)+" ",1),u]),d])}const k=o(l,[["render",h]]);export{_ as __pageData,k as default}; +import{_ as o,c as t,j as e,a,t as i,a4 as n,o as s}from"./chunks/framework.CI6U-QuP.js";const _=JSON.parse('{"title":"Common errors","description":"","frontmatter":{"title":"Common errors"},"headers":[],"relativePath":"guide/user-guide/common-errors.md","filePath":"guide/user-guide/common-errors.md","lastUpdated":1724832369000}'),l={name:"guide/user-guide/common-errors.md"},c={id:"frontmatter-title",tabindex:"-1"},u=e("a",{class:"header-anchor",href:"#frontmatter-title","aria-label":'Permalink to "{{ $frontmatter.title }}"'},"​",-1),d=n('

Warning

This page is still under construction. If you come across any problems please contact us.

Errors, frequent problems and how to avoid them

Problems with the installation and start of Docker:

  • Did you encounter problems while installing and starting Docker: you will find a detailed guide here.
  • Do you have difficulties starting the Docker containers for OCR4all? Is no server start possible? First, start Docker again (if necessary, reload OCR4all image anew and reset the corresponding container, following the steps described in the OCR4all setup guide here).
  • Are you using an Apple device with a M1 / M2 chip? We currently don't offer specific images for these systems but are working on it.

Problems selecting works in 'Project Overview':

  • If available works are not displayed in 'project overview', review your folder structure and check if it is correct, following the guidelines outlined in chapter 1.2. If there is no problem with your folder structure, delete the OCR4all Docker container and re-execute the docker run... command, following the setup guide here.
  • Are you unable to select a work? Please ensure that your work/document title contains no blanks or umlauts.

Problems with Calamari recognition or training:

  • Are you experiencing errors with mentions to AVX? If you're using an old CPU w/o AVX or on a virtual machine where AVX passthrough wasn't enabled you might run into several errors during the process execution as official TensorFlow builds don't offer support for these systems.

We welcome all questions and encourage to contact us if you have any problem. Please send an email (consultation, guides, and non-technical user support) or contact us on GitHub.

',9);function h(r,m,p,f,g,b){return s(),t("div",null,[e("h1",c,[a(i(r.$frontmatter.title)+" ",1),u]),d])}const k=o(l,[["render",h]]);export{_ as __pageData,k as default}; diff --git a/assets/guide_user-guide_common-errors.md.CnzWhStX.lean.js b/assets/guide_user-guide_common-errors.md.BlhiEKoZ.lean.js similarity index 91% rename from assets/guide_user-guide_common-errors.md.CnzWhStX.lean.js rename to assets/guide_user-guide_common-errors.md.BlhiEKoZ.lean.js index 56abee7a..55d56500 100644 --- a/assets/guide_user-guide_common-errors.md.CnzWhStX.lean.js +++ b/assets/guide_user-guide_common-errors.md.BlhiEKoZ.lean.js @@ -1 +1 @@ -import{_ as o,c as t,j as e,a,t as i,a4 as n,o as s}from"./chunks/framework.CI6U-QuP.js";const _=JSON.parse('{"title":"Common errors","description":"","frontmatter":{"title":"Common errors"},"headers":[],"relativePath":"guide/user-guide/common-errors.md","filePath":"guide/user-guide/common-errors.md","lastUpdated":1724226141000}'),l={name:"guide/user-guide/common-errors.md"},c={id:"frontmatter-title",tabindex:"-1"},u=e("a",{class:"header-anchor",href:"#frontmatter-title","aria-label":'Permalink to "{{ $frontmatter.title }}"'},"​",-1),d=n("",9);function h(r,m,p,f,g,b){return s(),t("div",null,[e("h1",c,[a(i(r.$frontmatter.title)+" ",1),u]),d])}const k=o(l,[["render",h]]);export{_ as __pageData,k as default}; +import{_ as o,c as t,j as e,a,t as i,a4 as n,o as s}from"./chunks/framework.CI6U-QuP.js";const _=JSON.parse('{"title":"Common errors","description":"","frontmatter":{"title":"Common errors"},"headers":[],"relativePath":"guide/user-guide/common-errors.md","filePath":"guide/user-guide/common-errors.md","lastUpdated":1724832369000}'),l={name:"guide/user-guide/common-errors.md"},c={id:"frontmatter-title",tabindex:"-1"},u=e("a",{class:"header-anchor",href:"#frontmatter-title","aria-label":'Permalink to "{{ $frontmatter.title }}"'},"​",-1),d=n("",9);function h(r,m,p,f,g,b){return s(),t("div",null,[e("h1",c,[a(i(r.$frontmatter.title)+" ",1),u]),d])}const k=o(l,[["render",h]]);export{_ as __pageData,k as default}; diff --git a/assets/guide_user-guide_introduction.md.C5p2P9wk.js b/assets/guide_user-guide_introduction.md.Cg1IJE4k.js similarity index 98% rename from assets/guide_user-guide_introduction.md.C5p2P9wk.js rename to assets/guide_user-guide_introduction.md.Cg1IJE4k.js index 9d8699dd..9a1a47ac 100644 --- a/assets/guide_user-guide_introduction.md.C5p2P9wk.js +++ b/assets/guide_user-guide_introduction.md.Cg1IJE4k.js @@ -1 +1 @@ -import{_ as n,c as i,j as e,a as t,t as a,o as s}from"./chunks/framework.CI6U-QuP.js";const r="/images/user-guide/introduction/principal_components_ocr4all_workflow.png",x=JSON.parse('{"title":"Introduction","description":"","frontmatter":{"title":"Introduction"},"headers":[],"relativePath":"guide/user-guide/introduction.md","filePath":"guide/user-guide/introduction.md","lastUpdated":1724226141000}'),l={name:"guide/user-guide/introduction.md"},c={id:"user-guide-–-frontmatter-title",tabindex:"-1"},d=e("a",{class:"header-anchor",href:"#user-guide-–-frontmatter-title","aria-label":'Permalink to "User Guide – {{ $frontmatter.title }}"'},"​",-1),p=e("p",null,"OCR4all is a software which is primarily geared towards the digital text recovery and recognition of early modern prints, whose elaborate printing types and mostly uneven layout challenge the abilities of most standard text recognition software. The workflow established by OCR4all isn’t only easy to understand, but it also allows for an independent use, which makes it particularly suitable for users with no background in computer sciences, in part because it combines different tools into one consistent user interface. Constant switching between different software platforms is thereby rendered obsolete.",-1),h=e("p",null,"OCR4all contains a complete and exhaustive OCR workflow, starting with the pre-processing of the images in question (Preprocessing), followed by layout segmentation (Region Segmentation, done with LAREX), the extraction of classified layout regions and line segmentation (Line Segmentation), text recognition (Recognition) and ending with the correction of the textual end product (Ground Truth Production) – all the while developing OCR models adapted to specific printed texts (fig. 1).",-1),u=e("p",null,[e("img",{src:r,alt:"fig. 1. Principal components of the OCR4all workflow."}),t(" fig. 1. Principal components of the OCR4all workflow.")],-1),f=e("p",null,"In part thanks to the possibility of developing and training book-specific recognition models – which can then theoretically be applied to other prints – OCR4all produces very good results when it comes to the digital recognition of about any printed text.",-1),g=e("p",null,"The following guide aims to provide an exhaustive and detailed look into OCR4all’s operation and fields of application concerning the recognition of particularly early prints. While chapter 1 covers the software’s set up and folder structure, chapter 2 concentrates on the recommended pre-processing of scans and image data, a step which occurs outside OCR4all and leads not only to a visible improvement of the results but facilitates the different steps within the OCR4all workflow. Chapter 3 focuses on starting the software and presenting its basic functions. It is followed, in chapter 4, by a detailed, step-by-step description of the different stages of the workflow, introducing the actual processing of prints and generation of the OCR text. Finally, chapter 5 takes on the most common user problems currently known.",-1);function m(o,w,_,y,b,k){return s(),i("div",null,[e("h1",c,[t("User Guide – "+a(o.$frontmatter.title)+" ",1),d]),p,h,u,f,g])}const C=n(l,[["render",m]]);export{x as __pageData,C as default}; +import{_ as n,c as i,j as e,a as t,t as a,o as s}from"./chunks/framework.CI6U-QuP.js";const r="/images/user-guide/introduction/principal_components_ocr4all_workflow.png",x=JSON.parse('{"title":"Introduction","description":"","frontmatter":{"title":"Introduction"},"headers":[],"relativePath":"guide/user-guide/introduction.md","filePath":"guide/user-guide/introduction.md","lastUpdated":1724832369000}'),l={name:"guide/user-guide/introduction.md"},c={id:"user-guide-–-frontmatter-title",tabindex:"-1"},d=e("a",{class:"header-anchor",href:"#user-guide-–-frontmatter-title","aria-label":'Permalink to "User Guide – {{ $frontmatter.title }}"'},"​",-1),p=e("p",null,"OCR4all is a software which is primarily geared towards the digital text recovery and recognition of early modern prints, whose elaborate printing types and mostly uneven layout challenge the abilities of most standard text recognition software. The workflow established by OCR4all isn’t only easy to understand, but it also allows for an independent use, which makes it particularly suitable for users with no background in computer sciences, in part because it combines different tools into one consistent user interface. Constant switching between different software platforms is thereby rendered obsolete.",-1),h=e("p",null,"OCR4all contains a complete and exhaustive OCR workflow, starting with the pre-processing of the images in question (Preprocessing), followed by layout segmentation (Region Segmentation, done with LAREX), the extraction of classified layout regions and line segmentation (Line Segmentation), text recognition (Recognition) and ending with the correction of the textual end product (Ground Truth Production) – all the while developing OCR models adapted to specific printed texts (fig. 1).",-1),u=e("p",null,[e("img",{src:r,alt:"fig. 1. Principal components of the OCR4all workflow."}),t(" fig. 1. Principal components of the OCR4all workflow.")],-1),f=e("p",null,"In part thanks to the possibility of developing and training book-specific recognition models – which can then theoretically be applied to other prints – OCR4all produces very good results when it comes to the digital recognition of about any printed text.",-1),g=e("p",null,"The following guide aims to provide an exhaustive and detailed look into OCR4all’s operation and fields of application concerning the recognition of particularly early prints. While chapter 1 covers the software’s set up and folder structure, chapter 2 concentrates on the recommended pre-processing of scans and image data, a step which occurs outside OCR4all and leads not only to a visible improvement of the results but facilitates the different steps within the OCR4all workflow. Chapter 3 focuses on starting the software and presenting its basic functions. It is followed, in chapter 4, by a detailed, step-by-step description of the different stages of the workflow, introducing the actual processing of prints and generation of the OCR text. Finally, chapter 5 takes on the most common user problems currently known.",-1);function m(o,w,_,y,b,k){return s(),i("div",null,[e("h1",c,[t("User Guide – "+a(o.$frontmatter.title)+" ",1),d]),p,h,u,f,g])}const C=n(l,[["render",m]]);export{x as __pageData,C as default}; diff --git a/assets/guide_user-guide_introduction.md.C5p2P9wk.lean.js b/assets/guide_user-guide_introduction.md.Cg1IJE4k.lean.js similarity index 98% rename from assets/guide_user-guide_introduction.md.C5p2P9wk.lean.js rename to assets/guide_user-guide_introduction.md.Cg1IJE4k.lean.js index 9d8699dd..9a1a47ac 100644 --- a/assets/guide_user-guide_introduction.md.C5p2P9wk.lean.js +++ b/assets/guide_user-guide_introduction.md.Cg1IJE4k.lean.js @@ -1 +1 @@ -import{_ as n,c as i,j as e,a as t,t as a,o as s}from"./chunks/framework.CI6U-QuP.js";const r="/images/user-guide/introduction/principal_components_ocr4all_workflow.png",x=JSON.parse('{"title":"Introduction","description":"","frontmatter":{"title":"Introduction"},"headers":[],"relativePath":"guide/user-guide/introduction.md","filePath":"guide/user-guide/introduction.md","lastUpdated":1724226141000}'),l={name:"guide/user-guide/introduction.md"},c={id:"user-guide-–-frontmatter-title",tabindex:"-1"},d=e("a",{class:"header-anchor",href:"#user-guide-–-frontmatter-title","aria-label":'Permalink to "User Guide – {{ $frontmatter.title }}"'},"​",-1),p=e("p",null,"OCR4all is a software which is primarily geared towards the digital text recovery and recognition of early modern prints, whose elaborate printing types and mostly uneven layout challenge the abilities of most standard text recognition software. The workflow established by OCR4all isn’t only easy to understand, but it also allows for an independent use, which makes it particularly suitable for users with no background in computer sciences, in part because it combines different tools into one consistent user interface. Constant switching between different software platforms is thereby rendered obsolete.",-1),h=e("p",null,"OCR4all contains a complete and exhaustive OCR workflow, starting with the pre-processing of the images in question (Preprocessing), followed by layout segmentation (Region Segmentation, done with LAREX), the extraction of classified layout regions and line segmentation (Line Segmentation), text recognition (Recognition) and ending with the correction of the textual end product (Ground Truth Production) – all the while developing OCR models adapted to specific printed texts (fig. 1).",-1),u=e("p",null,[e("img",{src:r,alt:"fig. 1. Principal components of the OCR4all workflow."}),t(" fig. 1. Principal components of the OCR4all workflow.")],-1),f=e("p",null,"In part thanks to the possibility of developing and training book-specific recognition models – which can then theoretically be applied to other prints – OCR4all produces very good results when it comes to the digital recognition of about any printed text.",-1),g=e("p",null,"The following guide aims to provide an exhaustive and detailed look into OCR4all’s operation and fields of application concerning the recognition of particularly early prints. While chapter 1 covers the software’s set up and folder structure, chapter 2 concentrates on the recommended pre-processing of scans and image data, a step which occurs outside OCR4all and leads not only to a visible improvement of the results but facilitates the different steps within the OCR4all workflow. Chapter 3 focuses on starting the software and presenting its basic functions. It is followed, in chapter 4, by a detailed, step-by-step description of the different stages of the workflow, introducing the actual processing of prints and generation of the OCR text. Finally, chapter 5 takes on the most common user problems currently known.",-1);function m(o,w,_,y,b,k){return s(),i("div",null,[e("h1",c,[t("User Guide – "+a(o.$frontmatter.title)+" ",1),d]),p,h,u,f,g])}const C=n(l,[["render",m]]);export{x as __pageData,C as default}; +import{_ as n,c as i,j as e,a as t,t as a,o as s}from"./chunks/framework.CI6U-QuP.js";const r="/images/user-guide/introduction/principal_components_ocr4all_workflow.png",x=JSON.parse('{"title":"Introduction","description":"","frontmatter":{"title":"Introduction"},"headers":[],"relativePath":"guide/user-guide/introduction.md","filePath":"guide/user-guide/introduction.md","lastUpdated":1724832369000}'),l={name:"guide/user-guide/introduction.md"},c={id:"user-guide-–-frontmatter-title",tabindex:"-1"},d=e("a",{class:"header-anchor",href:"#user-guide-–-frontmatter-title","aria-label":'Permalink to "User Guide – {{ $frontmatter.title }}"'},"​",-1),p=e("p",null,"OCR4all is a software which is primarily geared towards the digital text recovery and recognition of early modern prints, whose elaborate printing types and mostly uneven layout challenge the abilities of most standard text recognition software. The workflow established by OCR4all isn’t only easy to understand, but it also allows for an independent use, which makes it particularly suitable for users with no background in computer sciences, in part because it combines different tools into one consistent user interface. Constant switching between different software platforms is thereby rendered obsolete.",-1),h=e("p",null,"OCR4all contains a complete and exhaustive OCR workflow, starting with the pre-processing of the images in question (Preprocessing), followed by layout segmentation (Region Segmentation, done with LAREX), the extraction of classified layout regions and line segmentation (Line Segmentation), text recognition (Recognition) and ending with the correction of the textual end product (Ground Truth Production) – all the while developing OCR models adapted to specific printed texts (fig. 1).",-1),u=e("p",null,[e("img",{src:r,alt:"fig. 1. Principal components of the OCR4all workflow."}),t(" fig. 1. Principal components of the OCR4all workflow.")],-1),f=e("p",null,"In part thanks to the possibility of developing and training book-specific recognition models – which can then theoretically be applied to other prints – OCR4all produces very good results when it comes to the digital recognition of about any printed text.",-1),g=e("p",null,"The following guide aims to provide an exhaustive and detailed look into OCR4all’s operation and fields of application concerning the recognition of particularly early prints. While chapter 1 covers the software’s set up and folder structure, chapter 2 concentrates on the recommended pre-processing of scans and image data, a step which occurs outside OCR4all and leads not only to a visible improvement of the results but facilitates the different steps within the OCR4all workflow. Chapter 3 focuses on starting the software and presenting its basic functions. It is followed, in chapter 4, by a detailed, step-by-step description of the different stages of the workflow, introducing the actual processing of prints and generation of the OCR text. Finally, chapter 5 takes on the most common user problems currently known.",-1);function m(o,w,_,y,b,k){return s(),i("div",null,[e("h1",c,[t("User Guide – "+a(o.$frontmatter.title)+" ",1),d]),p,h,u,f,g])}const C=n(l,[["render",m]]);export{x as __pageData,C as default}; diff --git a/assets/guide_user-guide_project-start-and-overview.md.D1LNg9Eo.js b/assets/guide_user-guide_project-start-and-overview.md.Biix-mp8.js similarity index 98% rename from assets/guide_user-guide_project-start-and-overview.md.D1LNg9Eo.js rename to assets/guide_user-guide_project-start-and-overview.md.Biix-mp8.js index 12dda995..badb33f2 100644 --- a/assets/guide_user-guide_project-start-and-overview.md.D1LNg9Eo.js +++ b/assets/guide_user-guide_project-start-and-overview.md.Biix-mp8.js @@ -1 +1 @@ -import{_ as i,c as o,j as e,a as r,t as a,a4 as l,o as n}from"./chunks/framework.CI6U-QuP.js";const s="/images/user-guide/project_start_and_overview/project_overview_settings.jpg",c="/images/user-guide/project_start_and_overview/data_conversion_request.png",p="/images/user-guide/project_start_and_overview/overview.png",y=JSON.parse('{"title":"Project Start and Overview","description":"","frontmatter":{"title":"Project Start and Overview"},"headers":[],"relativePath":"guide/user-guide/project-start-and-overview.md","filePath":"guide/user-guide/project-start-and-overview.md","lastUpdated":1724226141000}'),d={name:"guide/user-guide/project-start-and-overview.md"},u={id:"frontmatter-title",tabindex:"-1"},g=e("a",{class:"header-anchor",href:"#frontmatter-title","aria-label":'Permalink to "{{ $frontmatter.title }}"'},"​",-1),h=l('

Start Docker:

  • Linux: Docker will start automatically after the computer starts
  • Docker for Windows: start Docker by clicking on the Docker icon (in ‘Programs’) – wait until “Docker is running” pops up
  • Docker Toolbox: open the Docker QuickStart terminal and wait until “Docker is configured to use default machine…” pops up

Start OCR4all:

  • Linux: open the terminal, type in docker start -ia ocr4all, press 'enter' and wait for the server to start
  • Windows 10 (Home, Pro, Enterprise, Education): open Windows PowerShell, type in docker start -ia ocr4all, press 'enter' and wait for the server to start
  • Older Versions of Windows (with Docker Toolbox): open the Docker QuickStart terminal and wait until “Docker is configured to use default machine…” pops up

After this initial installation, you will be able to easily access OCR4all in your browser, respectively under:

  • Linux, Docker for Windows, macOS: http://localhost:1476/ocr4all/
  • Docker Toolbox: http://192.168.99.100:1476/ocr4all/

Once OCR4all has been opened in a browser, the user will automatically land on the 'Project Overview' starting page. From there, they will be able to access several features:

  • 'Settings': This feature allows for selecting the book set to be worked on, which can be chosen from the dropdown menu found under ‘Project selection’ – the title having been previously saved as a folder under ocr4all/data/book title (see 1.2). Additionally, the ‘gray’ setting must be selected under the menu point ‘Project image type’.

Abb2.jpg

fig. 2: Project Overview settings.

  • Following this initial set up, click on ‘load project’ in order for the book in question to be uploaded to the OCR4all platform. Seeing as OCR4all only accepts certain file designations and formats (i.e. 0001.png etc.), a data conversion might be requested which can be directly carried out in OCR4all (fig. 3).

  • It is irrelevant whether a PDF or individual images were placed in the 'input' folder. If possible, however, single images are usually preferred.

Abb3.png

fig. 3. Data conversion request (i.g. PDF in 'input' folder).

  • In OCR4all, all data generated during the workflow and for its functioning are saved in a single PAGE XML file per scan page and are no longer kept in the form of many individual files. If project data from previous versions is still available, it is now possible to convert the project automatically into the structure required by OCR4all.

  • The feature "Overview" provides the user with a tabular presentation of the project’s ongoing progress (fig. 4). Each row corresponds to an individual book page, labelled by a page identifier (far left column). The following columns illustrate, from left to right, the workflow’s progression. Once a particular step has been executed, it will appear as completed (green check mark) in that work stage’s specific column.

Abb4.png

fig. 4: Overview.

  • Clicking on an individual page’s identifier enables the user to check on the state of that specific page’s processing, as well as on the data generated by it, at any time during the workflow. To this effect, please use the ‘images’ column, as well as the subsequent options: ‘original’, ‘binary’, ‘gray’ and ‘noise removal’.

  • With the button "Export GT" (top right) all data created in the course of the project can be exported and packed as a zip folder within 'data'.

',17);function f(t,m,w,v,_,b){return n(),o("div",null,[e("h1",u,[r(a(t.$frontmatter.title)+" ",1),g]),h])}const j=i(d,[["render",f]]);export{y as __pageData,j as default}; +import{_ as i,c as o,j as e,a as r,t as a,a4 as l,o as n}from"./chunks/framework.CI6U-QuP.js";const s="/images/user-guide/project_start_and_overview/project_overview_settings.jpg",c="/images/user-guide/project_start_and_overview/data_conversion_request.png",p="/images/user-guide/project_start_and_overview/overview.png",y=JSON.parse('{"title":"Project Start and Overview","description":"","frontmatter":{"title":"Project Start and Overview"},"headers":[],"relativePath":"guide/user-guide/project-start-and-overview.md","filePath":"guide/user-guide/project-start-and-overview.md","lastUpdated":1724832369000}'),d={name:"guide/user-guide/project-start-and-overview.md"},u={id:"frontmatter-title",tabindex:"-1"},g=e("a",{class:"header-anchor",href:"#frontmatter-title","aria-label":'Permalink to "{{ $frontmatter.title }}"'},"​",-1),h=l('

Start Docker:

  • Linux: Docker will start automatically after the computer starts
  • Docker for Windows: start Docker by clicking on the Docker icon (in ‘Programs’) – wait until “Docker is running” pops up
  • Docker Toolbox: open the Docker QuickStart terminal and wait until “Docker is configured to use default machine…” pops up

Start OCR4all:

  • Linux: open the terminal, type in docker start -ia ocr4all, press 'enter' and wait for the server to start
  • Windows 10 (Home, Pro, Enterprise, Education): open Windows PowerShell, type in docker start -ia ocr4all, press 'enter' and wait for the server to start
  • Older Versions of Windows (with Docker Toolbox): open the Docker QuickStart terminal and wait until “Docker is configured to use default machine…” pops up

After this initial installation, you will be able to easily access OCR4all in your browser, respectively under:

  • Linux, Docker for Windows, macOS: http://localhost:1476/ocr4all/
  • Docker Toolbox: http://192.168.99.100:1476/ocr4all/

Once OCR4all has been opened in a browser, the user will automatically land on the 'Project Overview' starting page. From there, they will be able to access several features:

  • 'Settings': This feature allows for selecting the book set to be worked on, which can be chosen from the dropdown menu found under ‘Project selection’ – the title having been previously saved as a folder under ocr4all/data/book title (see 1.2). Additionally, the ‘gray’ setting must be selected under the menu point ‘Project image type’.

Abb2.jpg

fig. 2: Project Overview settings.

  • Following this initial set up, click on ‘load project’ in order for the book in question to be uploaded to the OCR4all platform. Seeing as OCR4all only accepts certain file designations and formats (i.e. 0001.png etc.), a data conversion might be requested which can be directly carried out in OCR4all (fig. 3).

  • It is irrelevant whether a PDF or individual images were placed in the 'input' folder. If possible, however, single images are usually preferred.

Abb3.png

fig. 3. Data conversion request (i.g. PDF in 'input' folder).

  • In OCR4all, all data generated during the workflow and for its functioning are saved in a single PAGE XML file per scan page and are no longer kept in the form of many individual files. If project data from previous versions is still available, it is now possible to convert the project automatically into the structure required by OCR4all.

  • The feature "Overview" provides the user with a tabular presentation of the project’s ongoing progress (fig. 4). Each row corresponds to an individual book page, labelled by a page identifier (far left column). The following columns illustrate, from left to right, the workflow’s progression. Once a particular step has been executed, it will appear as completed (green check mark) in that work stage’s specific column.

Abb4.png

fig. 4: Overview.

  • Clicking on an individual page’s identifier enables the user to check on the state of that specific page’s processing, as well as on the data generated by it, at any time during the workflow. To this effect, please use the ‘images’ column, as well as the subsequent options: ‘original’, ‘binary’, ‘gray’ and ‘noise removal’.

  • With the button "Export GT" (top right) all data created in the course of the project can be exported and packed as a zip folder within 'data'.

',17);function f(t,m,w,v,_,b){return n(),o("div",null,[e("h1",u,[r(a(t.$frontmatter.title)+" ",1),g]),h])}const j=i(d,[["render",f]]);export{y as __pageData,j as default}; diff --git a/assets/guide_user-guide_project-start-and-overview.md.D1LNg9Eo.lean.js b/assets/guide_user-guide_project-start-and-overview.md.Biix-mp8.lean.js similarity index 93% rename from assets/guide_user-guide_project-start-and-overview.md.D1LNg9Eo.lean.js rename to assets/guide_user-guide_project-start-and-overview.md.Biix-mp8.lean.js index 318d1f2e..a5451263 100644 --- a/assets/guide_user-guide_project-start-and-overview.md.D1LNg9Eo.lean.js +++ b/assets/guide_user-guide_project-start-and-overview.md.Biix-mp8.lean.js @@ -1 +1 @@ -import{_ as i,c as o,j as e,a as r,t as a,a4 as l,o as n}from"./chunks/framework.CI6U-QuP.js";const s="/images/user-guide/project_start_and_overview/project_overview_settings.jpg",c="/images/user-guide/project_start_and_overview/data_conversion_request.png",p="/images/user-guide/project_start_and_overview/overview.png",y=JSON.parse('{"title":"Project Start and Overview","description":"","frontmatter":{"title":"Project Start and Overview"},"headers":[],"relativePath":"guide/user-guide/project-start-and-overview.md","filePath":"guide/user-guide/project-start-and-overview.md","lastUpdated":1724226141000}'),d={name:"guide/user-guide/project-start-and-overview.md"},u={id:"frontmatter-title",tabindex:"-1"},g=e("a",{class:"header-anchor",href:"#frontmatter-title","aria-label":'Permalink to "{{ $frontmatter.title }}"'},"​",-1),h=l("",17);function f(t,m,w,v,_,b){return n(),o("div",null,[e("h1",u,[r(a(t.$frontmatter.title)+" ",1),g]),h])}const j=i(d,[["render",f]]);export{y as __pageData,j as default}; +import{_ as i,c as o,j as e,a as r,t as a,a4 as l,o as n}from"./chunks/framework.CI6U-QuP.js";const s="/images/user-guide/project_start_and_overview/project_overview_settings.jpg",c="/images/user-guide/project_start_and_overview/data_conversion_request.png",p="/images/user-guide/project_start_and_overview/overview.png",y=JSON.parse('{"title":"Project Start and Overview","description":"","frontmatter":{"title":"Project Start and Overview"},"headers":[],"relativePath":"guide/user-guide/project-start-and-overview.md","filePath":"guide/user-guide/project-start-and-overview.md","lastUpdated":1724832369000}'),d={name:"guide/user-guide/project-start-and-overview.md"},u={id:"frontmatter-title",tabindex:"-1"},g=e("a",{class:"header-anchor",href:"#frontmatter-title","aria-label":'Permalink to "{{ $frontmatter.title }}"'},"​",-1),h=l("",17);function f(t,m,w,v,_,b){return n(),o("div",null,[e("h1",u,[r(a(t.$frontmatter.title)+" ",1),g]),h])}const j=i(d,[["render",f]]);export{y as __pageData,j as default}; diff --git a/assets/guide_user-guide_scan-preparation.md.BsXxbD_4.js b/assets/guide_user-guide_scan-preparation.md.DpWU4ver.js similarity index 97% rename from assets/guide_user-guide_scan-preparation.md.BsXxbD_4.js rename to assets/guide_user-guide_scan-preparation.md.DpWU4ver.js index 7969356d..163e4820 100644 --- a/assets/guide_user-guide_scan-preparation.md.BsXxbD_4.js +++ b/assets/guide_user-guide_scan-preparation.md.DpWU4ver.js @@ -1 +1 @@ -import{_ as a,c as o,j as e,a as i,t as n,o as s}from"./chunks/framework.CI6U-QuP.js";const v=JSON.parse('{"title":"Scan and Image Preparation (ScanTailor)","description":"","frontmatter":{"title":"Scan and Image Preparation (ScanTailor)"},"headers":[],"relativePath":"guide/user-guide/scan-preparation.md","filePath":"guide/user-guide/scan-preparation.md","lastUpdated":1724226141000}'),r={name:"guide/user-guide/scan-preparation.md"},l={id:"frontmatter-title",tabindex:"-1"},h=e("a",{class:"header-anchor",href:"#frontmatter-title","aria-label":'Permalink to "{{ $frontmatter.title }}"'},"​",-1),c=e("p",null,"When it comes to early modern prints, the available material exists often solely in the form of facsimilia. Although they generally exhibit a good if not very good quality, their overall condition makes them rather unsuited for a direct export in OCR4all. This is the case when the image file contains, aside the mere text, pictures of the book cover or printing surface. Were those images to be binarized during the workflow, black lines will often occur which are due to contrast differences in the original image and will impair both the OCR and the segmentation. Scan rotation and the display of two book pages on the same scan are other, frequent problems.",-1),d=e("p",null,"However, these complications can be easily avoided through the appropriate preparation of the image files. Therefore, scans destined to be processed with OCR4all should ideally only feature the content of each single page meant for the recognition process. At the time, the ideal scan should also contain enough blank page space so as not to impede further steps, such as segmentation. Thus, only the page content deemed irrelevant to the recognition process should be removed while taking care to leave as much of the original scanned page as possible (concretely, this means page margins shouldn’t be entirely removed).",-1),p=e("p",null,"Theoretically, most image editors are suitable (GIMP, Adobe Photoshop, etc.). If you have a PDF available it's also possible to cut and rotate the pages with Adobe Acrobat DC (Batch Processing). However, we advise the user to employ ScanTailor which sustains a considerable data quantity and processes images quickly, efficiently and in a standardized manner. Detailed instructions can be found here.",-1),u=e("p",null,"This step is completely optional and not part of the OCR4all workflow, which is why no support will be provided here. Each user has to decide for himself whether additional preprocessing of this kind would be profitable for his work or even necessary.",-1);function m(t,f,g,b,y,w){return s(),o("div",null,[e("h1",l,[i(n(t.$frontmatter.title)+" ",1),h]),c,d,p,u])}const k=a(r,[["render",m]]);export{v as __pageData,k as default}; +import{_ as a,c as o,j as e,a as i,t as n,o as s}from"./chunks/framework.CI6U-QuP.js";const v=JSON.parse('{"title":"Scan and Image Preparation (ScanTailor)","description":"","frontmatter":{"title":"Scan and Image Preparation (ScanTailor)"},"headers":[],"relativePath":"guide/user-guide/scan-preparation.md","filePath":"guide/user-guide/scan-preparation.md","lastUpdated":1724832369000}'),r={name:"guide/user-guide/scan-preparation.md"},l={id:"frontmatter-title",tabindex:"-1"},h=e("a",{class:"header-anchor",href:"#frontmatter-title","aria-label":'Permalink to "{{ $frontmatter.title }}"'},"​",-1),c=e("p",null,"When it comes to early modern prints, the available material exists often solely in the form of facsimilia. Although they generally exhibit a good if not very good quality, their overall condition makes them rather unsuited for a direct export in OCR4all. This is the case when the image file contains, aside the mere text, pictures of the book cover or printing surface. Were those images to be binarized during the workflow, black lines will often occur which are due to contrast differences in the original image and will impair both the OCR and the segmentation. Scan rotation and the display of two book pages on the same scan are other, frequent problems.",-1),d=e("p",null,"However, these complications can be easily avoided through the appropriate preparation of the image files. Therefore, scans destined to be processed with OCR4all should ideally only feature the content of each single page meant for the recognition process. At the time, the ideal scan should also contain enough blank page space so as not to impede further steps, such as segmentation. Thus, only the page content deemed irrelevant to the recognition process should be removed while taking care to leave as much of the original scanned page as possible (concretely, this means page margins shouldn’t be entirely removed).",-1),p=e("p",null,"Theoretically, most image editors are suitable (GIMP, Adobe Photoshop, etc.). If you have a PDF available it's also possible to cut and rotate the pages with Adobe Acrobat DC (Batch Processing). However, we advise the user to employ ScanTailor which sustains a considerable data quantity and processes images quickly, efficiently and in a standardized manner. Detailed instructions can be found here.",-1),u=e("p",null,"This step is completely optional and not part of the OCR4all workflow, which is why no support will be provided here. Each user has to decide for himself whether additional preprocessing of this kind would be profitable for his work or even necessary.",-1);function m(t,f,g,b,y,w){return s(),o("div",null,[e("h1",l,[i(n(t.$frontmatter.title)+" ",1),h]),c,d,p,u])}const k=a(r,[["render",m]]);export{v as __pageData,k as default}; diff --git a/assets/guide_user-guide_scan-preparation.md.BsXxbD_4.lean.js b/assets/guide_user-guide_scan-preparation.md.DpWU4ver.lean.js similarity index 97% rename from assets/guide_user-guide_scan-preparation.md.BsXxbD_4.lean.js rename to assets/guide_user-guide_scan-preparation.md.DpWU4ver.lean.js index 7969356d..163e4820 100644 --- a/assets/guide_user-guide_scan-preparation.md.BsXxbD_4.lean.js +++ b/assets/guide_user-guide_scan-preparation.md.DpWU4ver.lean.js @@ -1 +1 @@ -import{_ as a,c as o,j as e,a as i,t as n,o as s}from"./chunks/framework.CI6U-QuP.js";const v=JSON.parse('{"title":"Scan and Image Preparation (ScanTailor)","description":"","frontmatter":{"title":"Scan and Image Preparation (ScanTailor)"},"headers":[],"relativePath":"guide/user-guide/scan-preparation.md","filePath":"guide/user-guide/scan-preparation.md","lastUpdated":1724226141000}'),r={name:"guide/user-guide/scan-preparation.md"},l={id:"frontmatter-title",tabindex:"-1"},h=e("a",{class:"header-anchor",href:"#frontmatter-title","aria-label":'Permalink to "{{ $frontmatter.title }}"'},"​",-1),c=e("p",null,"When it comes to early modern prints, the available material exists often solely in the form of facsimilia. Although they generally exhibit a good if not very good quality, their overall condition makes them rather unsuited for a direct export in OCR4all. This is the case when the image file contains, aside the mere text, pictures of the book cover or printing surface. Were those images to be binarized during the workflow, black lines will often occur which are due to contrast differences in the original image and will impair both the OCR and the segmentation. Scan rotation and the display of two book pages on the same scan are other, frequent problems.",-1),d=e("p",null,"However, these complications can be easily avoided through the appropriate preparation of the image files. Therefore, scans destined to be processed with OCR4all should ideally only feature the content of each single page meant for the recognition process. At the time, the ideal scan should also contain enough blank page space so as not to impede further steps, such as segmentation. Thus, only the page content deemed irrelevant to the recognition process should be removed while taking care to leave as much of the original scanned page as possible (concretely, this means page margins shouldn’t be entirely removed).",-1),p=e("p",null,"Theoretically, most image editors are suitable (GIMP, Adobe Photoshop, etc.). If you have a PDF available it's also possible to cut and rotate the pages with Adobe Acrobat DC (Batch Processing). However, we advise the user to employ ScanTailor which sustains a considerable data quantity and processes images quickly, efficiently and in a standardized manner. Detailed instructions can be found here.",-1),u=e("p",null,"This step is completely optional and not part of the OCR4all workflow, which is why no support will be provided here. Each user has to decide for himself whether additional preprocessing of this kind would be profitable for his work or even necessary.",-1);function m(t,f,g,b,y,w){return s(),o("div",null,[e("h1",l,[i(n(t.$frontmatter.title)+" ",1),h]),c,d,p,u])}const k=a(r,[["render",m]]);export{v as __pageData,k as default}; +import{_ as a,c as o,j as e,a as i,t as n,o as s}from"./chunks/framework.CI6U-QuP.js";const v=JSON.parse('{"title":"Scan and Image Preparation (ScanTailor)","description":"","frontmatter":{"title":"Scan and Image Preparation (ScanTailor)"},"headers":[],"relativePath":"guide/user-guide/scan-preparation.md","filePath":"guide/user-guide/scan-preparation.md","lastUpdated":1724832369000}'),r={name:"guide/user-guide/scan-preparation.md"},l={id:"frontmatter-title",tabindex:"-1"},h=e("a",{class:"header-anchor",href:"#frontmatter-title","aria-label":'Permalink to "{{ $frontmatter.title }}"'},"​",-1),c=e("p",null,"When it comes to early modern prints, the available material exists often solely in the form of facsimilia. Although they generally exhibit a good if not very good quality, their overall condition makes them rather unsuited for a direct export in OCR4all. This is the case when the image file contains, aside the mere text, pictures of the book cover or printing surface. Were those images to be binarized during the workflow, black lines will often occur which are due to contrast differences in the original image and will impair both the OCR and the segmentation. Scan rotation and the display of two book pages on the same scan are other, frequent problems.",-1),d=e("p",null,"However, these complications can be easily avoided through the appropriate preparation of the image files. Therefore, scans destined to be processed with OCR4all should ideally only feature the content of each single page meant for the recognition process. At the time, the ideal scan should also contain enough blank page space so as not to impede further steps, such as segmentation. Thus, only the page content deemed irrelevant to the recognition process should be removed while taking care to leave as much of the original scanned page as possible (concretely, this means page margins shouldn’t be entirely removed).",-1),p=e("p",null,"Theoretically, most image editors are suitable (GIMP, Adobe Photoshop, etc.). If you have a PDF available it's also possible to cut and rotate the pages with Adobe Acrobat DC (Batch Processing). However, we advise the user to employ ScanTailor which sustains a considerable data quantity and processes images quickly, efficiently and in a standardized manner. Detailed instructions can be found here.",-1),u=e("p",null,"This step is completely optional and not part of the OCR4all workflow, which is why no support will be provided here. Each user has to decide for himself whether additional preprocessing of this kind would be profitable for his work or even necessary.",-1);function m(t,f,g,b,y,w){return s(),o("div",null,[e("h1",l,[i(n(t.$frontmatter.title)+" ",1),h]),c,d,p,u])}const k=a(r,[["render",m]]);export{v as __pageData,k as default}; diff --git a/assets/guide_user-guide_setup-and-folder-structure.md.DNL9jmhv.js b/assets/guide_user-guide_setup-and-folder-structure.md.Ci0ktCk8.js similarity index 94% rename from assets/guide_user-guide_setup-and-folder-structure.md.DNL9jmhv.js rename to assets/guide_user-guide_setup-and-folder-structure.md.Ci0ktCk8.js index 692499ee..8ac6292b 100644 --- a/assets/guide_user-guide_setup-and-folder-structure.md.DNL9jmhv.js +++ b/assets/guide_user-guide_setup-and-folder-structure.md.Ci0ktCk8.js @@ -1 +1 @@ -import{_ as a,c as o,j as e,a as s,t as d,a4 as r,o as l}from"./chunks/framework.CI6U-QuP.js";const b=JSON.parse('{"title":"Set up and folder structure","description":"","frontmatter":{"title":"Set up and folder structure"},"headers":[],"relativePath":"guide/user-guide/setup-and-folder-structure.md","filePath":"guide/user-guide/setup-and-folder-structure.md","lastUpdated":1724226141000}'),i={name:"guide/user-guide/setup-and-folder-structure.md"},n={id:"frontmatter-title",tabindex:"-1"},c=e("a",{class:"header-anchor",href:"#frontmatter-title","aria-label":'Permalink to "{{ $frontmatter.title }}"'},"​",-1),u=r("

Once OCR4all has been successfully installed, the ‘ocr4all’ folder and its two subfolders, data and models, provide the user with the basic and indispensable folder structure for the processing of printed texts.

data contains all the data the user intends to work on with OCR4all as well as all automatically generated data produced with OCR4all during the workflow. In order to complete the structure, data must contain a title folder, whose name can be freely chosen (whereby umlauts and blanks should be avoided) and which itself contains another subfolder titled input in which the original scans or images must be deposited. As the OCR4all workflow progresses, a processing folder will be automatically generated on the same system level, to which images corresponding to the processing stages of the user’s scans and PAGE XML files will be added.

Additionally, the user can save mixed recognition models in the ‘models’ folder (you will find a selection here). This folder will also contain book-specific models generated with OCR4all, which will be saved in sub folders named after the relevant book/work titles. Once a particular training starts, the generated models will be saved in such models/work_title folders and numbered accordingly, starting with 0.

",3);function h(t,f,p,m,_,g){return l(),o("div",null,[e("h1",n,[s(d(t.$frontmatter.title)+" ",1),c]),u])}const k=a(i,[["render",h]]);export{b as __pageData,k as default}; +import{_ as a,c as o,j as e,a as s,t as d,a4 as r,o as l}from"./chunks/framework.CI6U-QuP.js";const b=JSON.parse('{"title":"Set up and folder structure","description":"","frontmatter":{"title":"Set up and folder structure"},"headers":[],"relativePath":"guide/user-guide/setup-and-folder-structure.md","filePath":"guide/user-guide/setup-and-folder-structure.md","lastUpdated":1724832369000}'),i={name:"guide/user-guide/setup-and-folder-structure.md"},n={id:"frontmatter-title",tabindex:"-1"},c=e("a",{class:"header-anchor",href:"#frontmatter-title","aria-label":'Permalink to "{{ $frontmatter.title }}"'},"​",-1),u=r("

Once OCR4all has been successfully installed, the ‘ocr4all’ folder and its two subfolders, data and models, provide the user with the basic and indispensable folder structure for the processing of printed texts.

data contains all the data the user intends to work on with OCR4all as well as all automatically generated data produced with OCR4all during the workflow. In order to complete the structure, data must contain a title folder, whose name can be freely chosen (whereby umlauts and blanks should be avoided) and which itself contains another subfolder titled input in which the original scans or images must be deposited. As the OCR4all workflow progresses, a processing folder will be automatically generated on the same system level, to which images corresponding to the processing stages of the user’s scans and PAGE XML files will be added.

Additionally, the user can save mixed recognition models in the ‘models’ folder (you will find a selection here). This folder will also contain book-specific models generated with OCR4all, which will be saved in sub folders named after the relevant book/work titles. Once a particular training starts, the generated models will be saved in such models/work_title folders and numbered accordingly, starting with 0.

",3);function h(t,f,p,m,_,g){return l(),o("div",null,[e("h1",n,[s(d(t.$frontmatter.title)+" ",1),c]),u])}const k=a(i,[["render",h]]);export{b as __pageData,k as default}; diff --git a/assets/guide_user-guide_setup-and-folder-structure.md.DNL9jmhv.lean.js b/assets/guide_user-guide_setup-and-folder-structure.md.Ci0ktCk8.lean.js similarity index 83% rename from assets/guide_user-guide_setup-and-folder-structure.md.DNL9jmhv.lean.js rename to assets/guide_user-guide_setup-and-folder-structure.md.Ci0ktCk8.lean.js index 11fc08c4..b75b2965 100644 --- a/assets/guide_user-guide_setup-and-folder-structure.md.DNL9jmhv.lean.js +++ b/assets/guide_user-guide_setup-and-folder-structure.md.Ci0ktCk8.lean.js @@ -1 +1 @@ -import{_ as a,c as o,j as e,a as s,t as d,a4 as r,o as l}from"./chunks/framework.CI6U-QuP.js";const b=JSON.parse('{"title":"Set up and folder structure","description":"","frontmatter":{"title":"Set up and folder structure"},"headers":[],"relativePath":"guide/user-guide/setup-and-folder-structure.md","filePath":"guide/user-guide/setup-and-folder-structure.md","lastUpdated":1724226141000}'),i={name:"guide/user-guide/setup-and-folder-structure.md"},n={id:"frontmatter-title",tabindex:"-1"},c=e("a",{class:"header-anchor",href:"#frontmatter-title","aria-label":'Permalink to "{{ $frontmatter.title }}"'},"​",-1),u=r("",3);function h(t,f,p,m,_,g){return l(),o("div",null,[e("h1",n,[s(d(t.$frontmatter.title)+" ",1),c]),u])}const k=a(i,[["render",h]]);export{b as __pageData,k as default}; +import{_ as a,c as o,j as e,a as s,t as d,a4 as r,o as l}from"./chunks/framework.CI6U-QuP.js";const b=JSON.parse('{"title":"Set up and folder structure","description":"","frontmatter":{"title":"Set up and folder structure"},"headers":[],"relativePath":"guide/user-guide/setup-and-folder-structure.md","filePath":"guide/user-guide/setup-and-folder-structure.md","lastUpdated":1724832369000}'),i={name:"guide/user-guide/setup-and-folder-structure.md"},n={id:"frontmatter-title",tabindex:"-1"},c=e("a",{class:"header-anchor",href:"#frontmatter-title","aria-label":'Permalink to "{{ $frontmatter.title }}"'},"​",-1),u=r("",3);function h(t,f,p,m,_,g){return l(),o("div",null,[e("h1",n,[s(d(t.$frontmatter.title)+" ",1),c]),u])}const k=a(i,[["render",h]]);export{b as __pageData,k as default}; diff --git a/assets/guide_user-guide_workflow.md.BxdBwB_I.js b/assets/guide_user-guide_workflow.md.CNo_DpnQ.js similarity index 99% rename from assets/guide_user-guide_workflow.md.BxdBwB_I.js rename to assets/guide_user-guide_workflow.md.CNo_DpnQ.js index 8623fed9..d6384a6f 100644 --- a/assets/guide_user-guide_workflow.md.BxdBwB_I.js +++ b/assets/guide_user-guide_workflow.md.CNo_DpnQ.js @@ -1,2 +1,2 @@ -import{_ as o,c as n,j as t,a,t as s,a4 as r,o as l}from"./chunks/framework.CI6U-QuP.js";const d="/images/user-guide/workflow/process_flow.png",c="/images/user-guide/workflow/selection_of_an_appropiate_ocr_model.png",g="/images/user-guide/workflow/individual_lines_with_their_corresponding_ocr_results.png",u="/images/user-guide/workflow/pre-processing_settings.png",h="/images/user-guide/workflow/noise_removal_settings.png",p="/images/user-guide/workflow/LAREX_settings.png",m="/images/user-guide/workflow/LAREX_interface_with_automatic_segmentation_results.png",f="/images/user-guide/workflow/toolbar.png",w="/images/user-guide/workflow/sidebar_settings.png",y="/images/user-guide/workflow/range_options_regions.png",b="/images/user-guide/workflow/layout_regions_display_and_template.png",v="/images/user-guide/workflow/defining_new_layout_regions.png",k="/images/user-guide/workflow/parameters_settings.png",_="/images/user-guide/workflow/settings.png",x="/images/user-guide/workflow/toolbar_reading_order.png",R="/images/user-guide/workflow/auto_generated_results.png",T="/images/user-guide/workflow/region_of_interest.png",e="/images/user-guide/workflow/correcting_a_faulty_typification.png",O="/images/user-guide/workflow/drawing_a_line.png",C="/images/user-guide/workflow/correcting_typification.png",q="/images/user-guide/workflow/determining_reading_order.png",A="/images/user-guide/workflow/saving_segmentation_result.png",I="/images/user-guide/workflow/showing_contours.png",S="/images/user-guide/workflow/selecting_contours.png",P="/images/user-guide/workflow/aggregating_selected_items.png",L="/images/user-guide/workflow/typifying_new_layout_element.png",E="/images/user-guide/workflow/layout_element_contours.png",X="/images/user-guide/workflow/line_segmentation_settings.png",z="/images/user-guide/workflow/selection_of_model_package.png",j="/images/user-guide/workflow/ground_truth_production_text_view.png",G="/images/user-guide/workflow/ground_truth_production_page_view.png",D="/images/user-guide/workflow/evaluation.png",W="/images/user-guide/workflow/settings_for_training.png",U="/images/user-guide/workflow/adjusting_line_based_reading_order.png",M="/images/user-guide/workflow/result_generation.png",ee=JSON.parse('{"title":"Workflow","description":"","frontmatter":{"title":"Workflow"},"headers":[],"relativePath":"guide/user-guide/workflow.md","filePath":"guide/user-guide/workflow.md","lastUpdated":1724226141000}'),N={name:"guide/user-guide/workflow.md"},F={id:"frontmatter-title",tabindex:"-1"},Y=t("a",{class:"header-anchor",href:"#frontmatter-title","aria-label":'Permalink to "{{ $frontmatter.title }}"'},"​",-1),B=r('

Process Flow

This variant (main menu ☰ → Process Flow) allows for a virtually automated workflow. It merely requires the initial pick of the intended scans (sidebar on the right) and subsequent selection of the individual processing steps the user wishes to apply to the chosen data (fig. 5).

'Process flow' Subcomponents. fig. 6. 'Process flow' Subcomponents.

In order to complete the process, choose an appropriate OCR model (or model package, composed of five individual models working simultaneously and in concert – see chapter 4.7). Simply go to ‘setting’ → ‘recognition’ → ‘general’ (as illustrated in fig. 6) and choose from the list of available OCR models (‘line recognition models’ – ‘available’).

Selection of an appropriate OCR model. fig. 7. Selection of an appropriate OCR model.

Although it is generally possible to choose more than one recognition model, this is only recommended if the scans in question contain more than one printing type.

Finally, start the ‘process flow’ by clicking on ‘execute’. The current stage of this automated processing is translated into the progress bars and can be reviewed at any time. After the workflow’s completion, the results can be verified under the menu item ‘ground truth production’ (☰) .

Individual lines with their corresponding OCR results. fig. 8. Individual lines with their corresponding OCR results.

If the OCR’s line-based results immediately meet the desired or required accuracy of recognition, final results can be generated (TXT and / or PAGE XML) under menu item ‘result generation’. Were those results not to meet the user’s requirements, they can be once more corrected before the final generation (see chapter 4.8).

Aside this ‘process flow’, OCR4all additionally provides the option of a sequential workflow which enables the user to independently execute the software’s individual submodules (see fig. 1) and their components, thus ensuring the proper correctness and quality of the generated data. Considering that these submodules are built on one another, the sequential workflow seems to be the most adequate choice when working with early modern prints and their intricate, complex layout.

We recommend first-time users execute the sequential workflow at least once (as described below) in order to understand the submodules’ operating principles.

Preprocessing

Input: original image (in colour, greyscale or binarized)
Output: straightened binarized or greyscale image

  • This processing step is meant to produce binarized and normalized greyscale images, a basic requirement for a successful segmentation and recognition.
  • Proceed by selecting the relevant scans (sidebar on the right) – the settings must remain unchanged (‘settings (general)’ and ‘settings (advanced)’), meaning that the images’ angle as well as the automatically generated number of CPUs used by this particular submodule don’t vary either (the latter pertains to all of OCR4all’s subsequent submodules).

Pre-processing settings. fig. 9. Pre-processing settings.

  • Click on ‘execute’ to start binarization. The progression of this work stage can be tracked on the console, more accurately the ‘console output’. Warnings might be issued during the binarization process (in ‘console error’) which have no incidence on the binarization results.
  • In order to check the binarization’s success, simply go to ‘project overview’ and click on any page identifier then on the display option ‘binary’. In addition, all processed pages should be marked with a green check mark in the project overview.

Noise Removal

Input: polluted binarized images
Output: binarized images without (or with very little) pollution

-The noise removal option helps to get rid of small impurities such as stains or blotches on the scans

  • Proceed by clicking on ‘noise removal’ (main menu) and selecting the scans you wish to process on the right side of your display. You should initially conserve the default settings and, after clicking on ‘execute’, check the initial quality of the results: simply click on the designation of the scan you wish to verify (right sidebar); the ‘image preview’ option will provide you with a side by side comparison of the image before and after the noise removal. Please note that red elements will be deleted by this step.

Noise removal settings. fig. 10. Noise removal settings.

  • If too many interfering elements remain on the image, slightly adjust the ‘maximal size for removing contours’ factor upwards and repeat the step by clicking once again on ‘execute’ and subsequently reviewing the results.
  • If too many elements were removed from the image, readjust the ‘maximal size…’ factor downwards.
  • Try again until the results are satisfactory.

Segmentation – LAREX

Input: pre-processed images
Output: structural information about the layout regions (type and position) as well as reading order

LAREX is a segmentation tool which structures and classifies a printed page’s layout with regard to its later processing. LAREX is based on the basic assumption that the pages of early modern prints are composed of a recurring array of layout elements whose composition, although always book-specific, is largely consistent. Thus, the user is provided with different tools and resources whose aim it is to adequately structure and segment a printed page in order to catalogue all layout-related information necessary to the workflow’s subsequent steps. Besides the basic distinction between text and non-text (e.g. text vs. image/woodcut) and its further specifications (e.g. text headline, main text, page number etc.), this also includes information about the page’s reading order, i.e. the reading and usage order of the available layout elements.

Initial Settings

  • Menu: click on ‘segmentation’, then on ‘LAREX’
  • Go to ‘Segmentation image type’: select ‘binary’ if you will be working with binarized images, or ‘despeckled’ if the images went through the noise removal process
  • Click on ‘open LAREX’ → LAREX will open in a new tab

LAREX settings. fig. 11: LAREX settings.

Once LAREX has opened, the first one of the pre-selected pages will be visible at the centre of your display, including a few initial segmentation results, which are generated by the automatic segmentation each page undergoes when initially opened with LAREX. Please note that these results are not saved. From there, the user will have to adjust the settings, tailoring the initial segmentation results to their particular work’s layout and undertaking a manual post-correction to ensure segmentation accuracy.

LAREX interface with automatic segmentation results. fig. 12. LAREX interface with automatic segmentation results.

Overview and toolbar

The left sidebar displays all previously selected scans. Colour-coded markings visible in the bottom right corner indicate the current stage of each scan’s processing:

  • Orange exclamation mark: “there is no segmentation for this page” – no current segmentation results for this page
  • Orange warning sign: “current segmentation may be unsaved”
  • Green floppy disk: “segmentation was saved for this session” – segmentation results have been saved as an XML file
  • Green padlock: “there is a segmentation for this page on the server” – individual previously saved segmentation results (c.) have been marked as correct after completion of the entire document’s segmentation (see below).

fig. 13. Different display modes.

  • With the buttons '0' and '1' it is possible to switch between the binarized (black and white) and the normalized (grayscale) display mode. This selection is noted for all remaining pages of the project. It is possible to change the display mode again at any time.
  • In the topbar, you will find different tools and tool categories pertaining to navigation and image processing:

Different menu items in the toolbar. fig. 14. Different menu items in the toolbar.

  • Open a different book: No settings adjustments necessary for all LAREX versions as integrated in OCR4all.
  • Image Zoom: Enables general settings for image or scan display, such as zoom options. However, these can also be adjusted with your mouse and/or touchpad: shift the displayed page by left click-and-holding and moving your mouse; zoom using mouse wheel or touchpad.
  • Undo und Redo: Undo or redo last user action. Even common key combinations are possible (i.g. CTRL + Z for undo last action).
  • Delete selected items: Delete currently selected region.
  • RoI, Region, Segment, Order: In addition to the right sidebar, these are the different options for processing and segmenting scans. While the options featured in the toolbar generally pertain to the current scan’s processing (see below), the right sidebar features project-wide options across all scans.

Right sidebar’s settings. fig. 15. Right sidebar’s settings.

However, the latter can be amended, changed or adjusted at any time. In this case, we recommend saving all previously carried-out settings, whether they be related to recognition parameters (‘parameters’) or to document-specific layout elements (‘regions’) previously determined by the user, in ‘settings’. This will ensure these particular settings are applied the next time you work with this tool, enabling you to work with document-specific settings from then on.

Specific settings: ‘regions’, ‘parameters’, ‘reading order’, ‘settings’

  • 'Regions': In accordance with the LAREX concept, each scan (that is, each book page) is composed of several, distinct layout elements, e.g. main text (‘paragraph’), title, marginalia, page number, etc. Thus, LAREX requires that corresponding ‘regions’ be assigned to each of these layout elements. This assigning task must be consistently performed throughout the entire work, in preparation for further steps as well as for the actual recognition of the displayed content! Besides a small number of pre-set and predefined layout regions – for instance ‘image’ (graphics such as woodcuts and ornate initials), ‘paragraph’ (main text) or ‘page_number’ – the user can define and add further book-specific layout regions under ‘create’. Not only can the user change a region’s colour, but they can also define the minimum size of a textual/graphical page element which they wish to recognize as such (under ‘minSize’). The layout region thus defined can be added to the book-specific list by clicking on ‘save’.

Range of options under ‘Regions’. fig. 16. Range of options under ‘Regions’.

  • Moreover, the ‘regions’ feature enables the user to assign particular layout regions to a fixed and predefined location on the scan which will then be applied to the following scans. Provided a page’s layout is repeated throughout the entire book, the user can generate something of a layout template in order to improve segmentation and reduce the number of necessary corrections later on. In order to adjust the position of these layout regions to a book’s specific layout, simply display the layout region’s current position and adjust it by selecting the scanned page’s regions.

Layout regions display and template. fig. 17. Layout regions display and template.

Once a new region has been defined, its position on the page can be established by clicking on ‘Region’ → ‘Create a region rectangle (Shortcut: 1)’, an option located in the toolbar. This can be undone or changed at any time. Please note that the category ‘images’ can’t be assigned to a layout region on the page.

Defining new layout regions. fig. 18. Defining new layout regions.

All things considered, it isn’t always advisable to assign fixed positions to all layout regions for an entire book; if the position of certain regions such as chapter titles, mottos, page number or signature marks on the different pages is inconsistent, assigning predefined positions will lead to recognition errors. In this case, manually verifying and correcting these layout elements afterwards is the more practical approach. If the user needs to delete a layout region’s position, they can simply select the region in question and press the ‘delete’ key.

  • 'Parameters': Allows to define overall parameters of image and text recognition. Taking the time to pre-set certain book-specific parameters is recommended when working with an inconsistent layout, particularly that of early modern prints. These often feature great divergences of word and line spacing. To avoid a narrowly spaced group of lines from being recognised as one cohesive textual element, the ‘text dilation’ feature enables you to control and define the text’s degree of dilation in the x- and y-direction. This will enable the software to recognise originally too close word/line spacing or to recognise widely spaced passages as one cohesive element. We recommend trying and testing in order to find the settings best suited to a particular book.

Parameters settings. fig. 19. Parameters settings.

  • 'Settings': Under ‘Settings’ you can save the previously selected displaying and segmentation options as well as loading them anew after an interruption in segmentation (buttons ‘save settings’ and ‘load settings’). Saving will generate an XML file which you will need to select when loading the settings (click on ‘load settings’, a new window will open; select file in question and open it). An additional feature will enable you to re-load previous pages’ segmentation results if you wish to view them again: simply go to ‘advanced settings’ and click on ‘load now’. This will load any previously saved XML file containing that page’s segmentation results.

Settings. fig. 20: Settings.

  • 'Reading Order': In order for the correct order of a page’s textual elements to be taken into account in all steps following segmentation, it is necessary to define these elements’ ‘reading order’ beforehand. This can be done automatically provided a book’s layout be relatively clear and simple. However, should you be working with a more complex layout structure, we recommend you proceed manually. Simply select ‘auto generate a reading order’ or ‘set a reading order’ under toolbar item ‘Order’.

Reading order selection in toolbar fig. 21. Reading order selection in toolbar

By clicking on the auto reading order button, a list of all the page’s textual elements will appear in the right sidebar (under ‘reading order’), sorted from top to bottom. On the other hand, if you wish to manually establish reading order, you will need to click on each of the page’s textual elements, in the correct order (see below), after which this reading order will appear in the aforementioned list. All elements of the reading order can be rearranged with a drag-and-drop or deleted by clicking on the corresponding recycle bin icon. As with everything in LAREX, the reading order can always be changed before saving the final segmentation results.

Exemplary page segmentation

With each page loading, LAREX automatically generates segmentation results – these only need to be subsequently corrected. The following, exemplary segmentation process uses page 4 of reference book Cirurgia, which you can download here when downloading the OCR4all folder structure.

Error analysis: Which layout elements were correctly recognised, which incorrectly, which weren’t at all? Are there any user marks in the margins, bordures, spots or elements of text which will influence segmentation, but you wish to avoid being recognised?

Auto generated results, Cirurgia page 4. fig. 22. Auto generated results, Cirurgia page 4.

'Region of interest' (RoI): Defining a RoI will help exclude certain sections of your page, situated outside the area later subjected to recognition but which can negatively impact segmentation (such as user marks, impurities, library stamps, etc.). To do so, go to toolbar and click on ‘Set the region of interest’ (under ‘RoI’), then use left click-and-hold to draw a rectangle around the page section you wish to segment.

Defining a 'region of interest'. fig. 23. Defining a 'region of interest'.

Once RoI has been defined, click on 'SEGMENT' button (right sidebar) – all element situated outside of RoI are now excluded from any further steps. Once RoI has been defined, it will be automatically transposed to all the book's scans. However, due to a wide array of factors, the page sections relevant to segmentation can shift from scan to scan. Therefore, as processing progresses, the user will likely need to adjust RoI from time to time. To do so, simply click on any RoI section and shift it using the mouse. Independently of RoI, the 'Create an ignore rectangle' option creates an 'ignore region' which allows for certain, small sections of a scan to be ignored and thus excluded from segmentation.

Correcting layout recognition flaws: Incorrectly recognized layout elements can be assigned a new typification manually: a right-hand click on said element will open a pop-up menu from which you can choose the correct designation.

Correcting a faulty typification. fig. 24. Correcting a faulty typification.

Should you need to separate a title from another textual element with which it is fused, there three ways to proceed: To begin, you can draw a rectangle around the section you wish to classify: proceed to toolbar, click on ‘Segment’ and select ‘Create a fixed segment rectangle’ (shortcut: 3); using mouse, draw a rectangle around the relevant section – a pop-up menu will appear from

which it’s correct designation/type can be chosen. Next, you can instead choose to use a polygon shape. This option is particularly suited to the more complex or chaotic layouts and/ or those comprising angled edges, rounded pictures and woodcuts, or ornate initials inside the text block. Proceed to toolbar, click on ‘Segment’, this time selecting ‘Create a fixed segment polygon’ option (shortcut: 4). Using the mouse, generate a dotted line to go around end encompass the entire relevant section – once the line’s end has been joined to its starting point, creating a polygon, the aforementioned pop-up menu will appear to allow for designation. Finally, you can also separate a text block – Initially recognized as one paragraph – into a title and main text using a cutting line: simply go to toolbar and ‘Segment’, and select ‘Create a cut line’ option (shortcut: 5).

Correcting a faulty typification. fig. 25. Toolbar: selecting cut line option.

Using left mouse key, create a line through the element you wish to separate, clicking along its path to adjust it as needed; end line with a double click.

Drawing a line between two layout elements to be separated. fig. 26. Drawing a line between two layout elements to be separated.

Click on 'Segment' in order to prompt separation. Afterwards, title element can be correctly renamed, using right-hand click and pop-up menu (as shown below).

Correcting typification of separated sections. fig. 27. Correcting typification of separated sections.

If at any time you with to delete layout components, inaccurate cutting lines or polygons, etc. simply click on the relevant element and use ‘Delete’ key or ‘Delete selected items’ option in the toolbar.

Determining 'Reading Order' (see below):

Determining reading order. fig. 28. Determining reading order.

Saving current scan’s segmentation results: Save your segmentation results by clicking on ‘Save results’ or with Ctrl + S. This will automatically generate an XML file containing those results inside the OCR4all folder structure.

Saving segmentation results. fig. 29. Saving segmentation results.

Afterwards, you can proceed to the next scan (left sidebar). If you wish to redo or change a scan’s segmentation, you can do as much at any time: simply save the new results – the previous XML file will be automatically deleted and replaced with a new version.

Additional processing options

OCR4all also provides the following scan processing options:

  • While deleting layout elements or joining separate ones to form one, single region, you can select all relevant elements simultaneously by pressing and holding 'Shift' key and drawing a rectangle around the entire region using your mouse. Relevant regions must be located entirely inside the rectangle. Once done, selected region will be surrounded by a blue frame.
  • 'Select contours to combine (with 'C') to segments (see function combine)' (shortcut: 6): this tool is perfect for reaching optimal segmentation results even when working with scans featuring a densely packed and detailed print layout. The basic idea is that layout elements only be delimited by the contours of the individual letters/pictures they are composed of, thus solving the problems created by manual segmentation such as excessively broad margins, which can in turn hamper the OCR performance. To use this feature, click on the relevant button (toolbar) or use shortcut 6. All components of the scan recognized as layout elements will be coloured blue.

Showing contours. fig. 30. Showing contours.

Select individual letters or even parts of letter by clicking on them.

Selecting contours. fig. 31. Selecting contours.

You can also apply your selection to an individual group of letters, entire words or text lines, sections of a layout element (see above: ‘Shift’ + selection with rectangle). Use shortcut C after selection in order to include all selected items – be they letters, words, lines, etc. – in one, new layout element, regardless of the layout region they had previously belonged to. This new element’s edges will be far more precise that those of an automatically generated one, enabling a particularly accurate segmentation superior to that of standardised tools.

Aggregating selected items to create new element. fig. 32. Aggregating selected items to create new element.

Save new element by clicking on ‘Segment’. New element can be renamed as described above.

Typifying new layout element. fig. 33. Typifying new layout element.

  • 'Combine selected segments or contours' (shortcut: C): In order to combine several, distinct layout elements into a new element, select the entire region in question (see above) and click on corresponding button (toolbar) or use shortcut C.
  • 'Fix/unfix segments, for it to persist a new auto segmentation' (Shortcut: F): This function enables you to fix an element in one place beyond your next segmentation rounds. Mark element in question by clicking on it, then use shortcut F or corresponding button in toolbar. Fixed, i. e. pinned elements will appear surrounded by a dotted line. If you wish to cancel fixation, simply repeat the operation.
  • Zoom: Use mouse wheel to zoom in and out of display. Use space key to reset display to its original size.
  • When working with a very complex and intricate layout, targeted interventions can help increase the precision and quality of segmentation results. The contours of all layout elements (recognized as such) consist in fact of many individual lines, separated by dots.

Layout element contours. fig. 34. Layout element contours.

  • These tiny dots can be moved, individually or in groups, e.g. to avoid collision between different layout elements in a dense setting. Use a left click-and-hold to move a dot, click on the line to create a new dot, use 'delete' key to delete a dot.
  • Load results: a scan’s existing segmentation results will be sourced from OCR4all folder structure and directly loaded to LAREX.

Final steps with LAREX

Once a document’s entire segmentation has been completed with LAREX (i.e. once segmentation results have been saved for all pages), results can be found in the OCR4all folder structure. In order to make sure that results were correctly saved, simply go to menu item ‘post correction’, in the ‘segments’ bar (see below).

Line Segmentation

Input: pre-processed images and segmentation information (in the form of PAGE XML files)
Output: extracted text lines saved in those PAGE XML files

  • This step constitutes a direct preparation to the OCR process and features the dissection of all previously defined and classified layout elements into separate text lines (this a necessary step as the OCR is based on line recognition). All results are then automatically saved in the corresponding page XML files.

Line segmentation settings. fig. 35. Line segmentation settings.

  • Generally speaking, all existing settings can be retained. There are, however, a few restrictions when it comes to page layout: if you are working with pages featuring two or more text columns (and if those have been previously defined as separate, individual main text blocks in LAREX), you will need to change the ‘maximum # of whitespace column separators’ which is pre-set at -1.
    • 'Whitespace column separators' are the white columns devoid of text found around text blocks.
    • When working with a two-column layout whose text is continuous (i.e. where the first line of both columns don’t form a semantic unit), you will need to set the ‘maximum # of whitespace column separators’ at 3. This number corresponds to the whitespace on both sides of the columns and to the whitespace situated between them.
    • When working with a three-column layout, set the 'whitespace' number to 4, and so on.
  • Once all desired settings are chosen, click on ‘execute’. Afterwards, control generated results under ‘Project Overview’.
  • Using the ‘settings (advanced)’ option is especially useful when working with line segmentation, particularly if/when errors are reported (and shown on the interface). For instance, small letters will often fall short of the default minimal line width. You can adjust this minimal width by reducing the ‘minimum scale permitted’, which can be found under menu item ‘limits’. This will enable you to correctly re-do the line segmentation.
  • You can generally control the accuracy of line segmentation by clicking on the ‘lines’ button (under menu item ‘post correction’).

Recognition

Input: Text lines and one or more OCR models
Output: OCR-output in the form of text for each of the PAGE XML files at hand

  • This step is where the actual text recognition takes place based on the individual lines and textual layout elements identified during line segmentation (see above).
  • Select menu item 'Recognition': in the right sidebar, you will only find your document's scans (or rather printed pages) for which all OCR pre-processing steps have been completed, by which we mean all previously explained steps - bar 'noise removal'. Please select the scans for which you wish to produce an OCR text.
  • Go to 'line recognition models' (under 'available') and select all models or model packages relevant to the typographical recognition of your text (e.g. early modern/historical Gothic type, italic/cursive type, historical Antiqua etc.). We expressly advise the use of a model package, where five models simultaneously work and interact with each other! This is much preferable to using only one model at a time. You can select all models you wish to add to your package by clicking on each of them - they will automatically be added to the 'selected' category. When dealing with a large amount of models, you can find them by using the 'search' function.

Selection of model package for text recognition. fig. 36. Selection of model package for text recognition.

  • You likely won't need to adjust any of the advanced settings.
  • Click on 'execute' and oversee the text recognition progress on the console.
  • Once recognition is finished, you will be able to view all results under menu item 'ground truth production'.

Ground Truth Production

Input: text line images and their corresponding OCR output when available
Output: line based ground truth

  • Under menu item 'ground truth production' you will be able to view the texts generated during 'recognition', correct them and save them as a training model. This is the so called 'ground truth'.
  • The correction tool used in this step is divided into two parts. On the left handside are the (selectable) scans. In the middle, you will find the segmented text line images (see above for workflow) as well as their corresponding OCR text lines, placed directly underneath. We call this standard display 'text view'.

Ground truth production with 'text view'. fig. 37. Ground truth production with 'text view'.

Clicking on the 'Switch to page view' button will bring you to the so called 'page view' display, in which you can work on all text lines while they are displayed in relation to the entire page layout. By clicking 'switch to text view', you will return to the initial 'text view' display.

Ground truth production with 'page view'. fig. 38. Ground truth production with 'page view'.

  • On the right hand side of the display, you will find the virtual keyboard, with which you can set special characters such as ligatures, abbreviation, diacritical signs etc. Simply place your cursor where you with to insert a special character and then click on said character in the virtual keyboard. In order to add new characters to the virtual keyboard, simply click on the plus icon, add character through copy and paste in the blank and click on 'save'. if you wish to delete characters from the virtual keyboard, drag and drop said character on the recycle bin icon. Once all necessary/desired changes have been made, click on 'save' and 'lock'. Using buttons 'lad' and 'save' will ultimately enable you to save different virtual keyboards specific to any particular document. Once a virtual keyboard has been saved as such, it can be re-loaded at any time, which is particularly useful when you need to interrupt correction - or if you want to use this keyboard for another document for which it is suited.
  • In order to correct individual lines in 'text view' mode, click on the line in question: you can now correct and edit it. (When working with 'page view', you will need to click on the line you wish to edit first, after which a text field will appear in which you will be able to proceed to corrections/edits as well. Use 'tabulator' key to go to the next line, and so on. All following steps are identical in both viewers. Once a text line has been completely and satisfactorily corrected, press 'enter key'. The line will be coloured green, meaning it will be automatically saved as 'ground truth' in OCR4all once the entire page has been completed and saved (by clicking on 'save result' or using shortcut crtl + S). Once a line has been identified as ground truth, it can be used as a basis for OCR training as well as a tool to evaluate the OCR model you used.
  • If there are erroneously recognised text line images among your pairs of text lines images and corresponding OCR text lines, please let your OCR text lines unfilled to not cause problems during the OCR model training.
  • Were you to conclude, while working on ground truth production, that the quality of the text recognition achieved with mixed models wasn't satisfactory, you can always perform a final, manual text correction by employing a training model targeted towards the specific kind of document you are working on. Proceeding to this step will generally increase the recognition quality and percentage.

Evaluation

Input: line based OCR texts and corresponding ground truth
Output: error statistics

  • Under menu item 'evaluation', users can check on the recognition rate of the model(s) currently under use.

  • In order to generate an evaluation, go to right sidebar and select all the scans recognized with the help of said model and subsequently corrected during 'ground truth production'.

  • Click on 'execute': a chart will appear in the console. At the top, you will see the percentage of errors as well as the full count of errors ('errs'). All identified errors are listed underneath, displayed as a chart featuring the comparison between the initially recognized text ('PRED', righthand column) and the results of ground truth production ('GT', lefthand column). Behind each error item, you will see the frequency of that particular type of error as well as its percentage compared to the entire error count.

Evaluation results with general error rate, ten most frequent errors as well as their percentage
+import{_ as o,c as n,j as t,a,t as s,a4 as r,o as l}fromProcess Flow

This variant (main menu ☰ → Process Flow) allows for a virtually automated workflow. It merely requires the initial pick of the intended scans (sidebar on the right) and subsequent selection of the individual processing steps the user wishes to apply to the chosen data (fig. 5).

'Process flow' Subcomponents. fig. 6. 'Process flow' Subcomponents.

In order to complete the process, choose an appropriate OCR model (or model package, composed of five individual models working simultaneously and in concert – see chapter 4.7). Simply go to ‘setting’ → ‘recognition’ → ‘general’ (as illustrated in fig. 6) and choose from the list of available OCR models (‘line recognition models’ – ‘available’).

Selection of an appropriate OCR model. fig. 7. Selection of an appropriate OCR model.

Although it is generally possible to choose more than one recognition model, this is only recommended if the scans in question contain more than one printing type.

Finally, start the ‘process flow’ by clicking on ‘execute’. The current stage of this automated processing is translated into the progress bars and can be reviewed at any time. After the workflow’s completion, the results can be verified under the menu item ‘ground truth production’ (☰) .

Individual lines with their corresponding OCR results. fig. 8. Individual lines with their corresponding OCR results.

If the OCR’s line-based results immediately meet the desired or required accuracy of recognition, final results can be generated (TXT and / or PAGE XML) under menu item ‘result generation’. Were those results not to meet the user’s requirements, they can be once more corrected before the final generation (see chapter 4.8).

Aside this ‘process flow’, OCR4all additionally provides the option of a sequential workflow which enables the user to independently execute the software’s individual submodules (see fig. 1) and their components, thus ensuring the proper correctness and quality of the generated data. Considering that these submodules are built on one another, the sequential workflow seems to be the most adequate choice when working with early modern prints and their intricate, complex layout.

We recommend first-time users execute the sequential workflow at least once (as described below) in order to understand the submodules’ operating principles.

Preprocessing

Input: original image (in colour, greyscale or binarized)
Output: straightened binarized or greyscale image

  • This processing step is meant to produce binarized and normalized greyscale images, a basic requirement for a successful segmentation and recognition.
  • Proceed by selecting the relevant scans (sidebar on the right) – the settings must remain unchanged (‘settings (general)’ and ‘settings (advanced)’), meaning that the images’ angle as well as the automatically generated number of CPUs used by this particular submodule don’t vary either (the latter pertains to all of OCR4all’s subsequent submodules).

Pre-processing settings. fig. 9. Pre-processing settings.

  • Click on ‘execute’ to start binarization. The progression of this work stage can be tracked on the console, more accurately the ‘console output’. Warnings might be issued during the binarization process (in ‘console error’) which have no incidence on the binarization results.
  • In order to check the binarization’s success, simply go to ‘project overview’ and click on any page identifier then on the display option ‘binary’. In addition, all processed pages should be marked with a green check mark in the project overview.

Noise Removal

Input: polluted binarized images
Output: binarized images without (or with very little) pollution

-The noise removal option helps to get rid of small impurities such as stains or blotches on the scans

  • Proceed by clicking on ‘noise removal’ (main menu) and selecting the scans you wish to process on the right side of your display. You should initially conserve the default settings and, after clicking on ‘execute’, check the initial quality of the results: simply click on the designation of the scan you wish to verify (right sidebar); the ‘image preview’ option will provide you with a side by side comparison of the image before and after the noise removal. Please note that red elements will be deleted by this step.

Noise removal settings. fig. 10. Noise removal settings.

  • If too many interfering elements remain on the image, slightly adjust the ‘maximal size for removing contours’ factor upwards and repeat the step by clicking once again on ‘execute’ and subsequently reviewing the results.
  • If too many elements were removed from the image, readjust the ‘maximal size…’ factor downwards.
  • Try again until the results are satisfactory.

Segmentation – LAREX

Input: pre-processed images
Output: structural information about the layout regions (type and position) as well as reading order

LAREX is a segmentation tool which structures and classifies a printed page’s layout with regard to its later processing. LAREX is based on the basic assumption that the pages of early modern prints are composed of a recurring array of layout elements whose composition, although always book-specific, is largely consistent. Thus, the user is provided with different tools and resources whose aim it is to adequately structure and segment a printed page in order to catalogue all layout-related information necessary to the workflow’s subsequent steps. Besides the basic distinction between text and non-text (e.g. text vs. image/woodcut) and its further specifications (e.g. text headline, main text, page number etc.), this also includes information about the page’s reading order, i.e. the reading and usage order of the available layout elements.

Initial Settings

  • Menu: click on ‘segmentation’, then on ‘LAREX’
  • Go to ‘Segmentation image type’: select ‘binary’ if you will be working with binarized images, or ‘despeckled’ if the images went through the noise removal process
  • Click on ‘open LAREX’ → LAREX will open in a new tab

LAREX settings. fig. 11: LAREX settings.

Once LAREX has opened, the first one of the pre-selected pages will be visible at the centre of your display, including a few initial segmentation results, which are generated by the automatic segmentation each page undergoes when initially opened with LAREX. Please note that these results are not saved. From there, the user will have to adjust the settings, tailoring the initial segmentation results to their particular work’s layout and undertaking a manual post-correction to ensure segmentation accuracy.

LAREX interface with automatic segmentation results. fig. 12. LAREX interface with automatic segmentation results.

Overview and toolbar

The left sidebar displays all previously selected scans. Colour-coded markings visible in the bottom right corner indicate the current stage of each scan’s processing:

  • Orange exclamation mark: “there is no segmentation for this page” – no current segmentation results for this page
  • Orange warning sign: “current segmentation may be unsaved”
  • Green floppy disk: “segmentation was saved for this session” – segmentation results have been saved as an XML file
  • Green padlock: “there is a segmentation for this page on the server” – individual previously saved segmentation results (c.) have been marked as correct after completion of the entire document’s segmentation (see below).

fig. 13. Different display modes.

  • With the buttons '0' and '1' it is possible to switch between the binarized (black and white) and the normalized (grayscale) display mode. This selection is noted for all remaining pages of the project. It is possible to change the display mode again at any time.
  • In the topbar, you will find different tools and tool categories pertaining to navigation and image processing:

Different menu items in the toolbar. fig. 14. Different menu items in the toolbar.

  • Open a different book: No settings adjustments necessary for all LAREX versions as integrated in OCR4all.
  • Image Zoom: Enables general settings for image or scan display, such as zoom options. However, these can also be adjusted with your mouse and/or touchpad: shift the displayed page by left click-and-holding and moving your mouse; zoom using mouse wheel or touchpad.
  • Undo und Redo: Undo or redo last user action. Even common key combinations are possible (i.g. CTRL + Z for undo last action).
  • Delete selected items: Delete currently selected region.
  • RoI, Region, Segment, Order: In addition to the right sidebar, these are the different options for processing and segmenting scans. While the options featured in the toolbar generally pertain to the current scan’s processing (see below), the right sidebar features project-wide options across all scans.

Right sidebar’s settings. fig. 15. Right sidebar’s settings.

However, the latter can be amended, changed or adjusted at any time. In this case, we recommend saving all previously carried-out settings, whether they be related to recognition parameters (‘parameters’) or to document-specific layout elements (‘regions’) previously determined by the user, in ‘settings’. This will ensure these particular settings are applied the next time you work with this tool, enabling you to work with document-specific settings from then on.

Specific settings: ‘regions’, ‘parameters’, ‘reading order’, ‘settings’

  • 'Regions': In accordance with the LAREX concept, each scan (that is, each book page) is composed of several, distinct layout elements, e.g. main text (‘paragraph’), title, marginalia, page number, etc. Thus, LAREX requires that corresponding ‘regions’ be assigned to each of these layout elements. This assigning task must be consistently performed throughout the entire work, in preparation for further steps as well as for the actual recognition of the displayed content! Besides a small number of pre-set and predefined layout regions – for instance ‘image’ (graphics such as woodcuts and ornate initials), ‘paragraph’ (main text) or ‘page_number’ – the user can define and add further book-specific layout regions under ‘create’. Not only can the user change a region’s colour, but they can also define the minimum size of a textual/graphical page element which they wish to recognize as such (under ‘minSize’). The layout region thus defined can be added to the book-specific list by clicking on ‘save’.

Range of options under ‘Regions’. fig. 16. Range of options under ‘Regions’.

  • Moreover, the ‘regions’ feature enables the user to assign particular layout regions to a fixed and predefined location on the scan which will then be applied to the following scans. Provided a page’s layout is repeated throughout the entire book, the user can generate something of a layout template in order to improve segmentation and reduce the number of necessary corrections later on. In order to adjust the position of these layout regions to a book’s specific layout, simply display the layout region’s current position and adjust it by selecting the scanned page’s regions.

Layout regions display and template. fig. 17. Layout regions display and template.

Once a new region has been defined, its position on the page can be established by clicking on ‘Region’ → ‘Create a region rectangle (Shortcut: 1)’, an option located in the toolbar. This can be undone or changed at any time. Please note that the category ‘images’ can’t be assigned to a layout region on the page.

Defining new layout regions. fig. 18. Defining new layout regions.

All things considered, it isn’t always advisable to assign fixed positions to all layout regions for an entire book; if the position of certain regions such as chapter titles, mottos, page number or signature marks on the different pages is inconsistent, assigning predefined positions will lead to recognition errors. In this case, manually verifying and correcting these layout elements afterwards is the more practical approach. If the user needs to delete a layout region’s position, they can simply select the region in question and press the ‘delete’ key.

  • 'Parameters': Allows to define overall parameters of image and text recognition. Taking the time to pre-set certain book-specific parameters is recommended when working with an inconsistent layout, particularly that of early modern prints. These often feature great divergences of word and line spacing. To avoid a narrowly spaced group of lines from being recognised as one cohesive textual element, the ‘text dilation’ feature enables you to control and define the text’s degree of dilation in the x- and y-direction. This will enable the software to recognise originally too close word/line spacing or to recognise widely spaced passages as one cohesive element. We recommend trying and testing in order to find the settings best suited to a particular book.

Parameters settings. fig. 19. Parameters settings.

  • 'Settings': Under ‘Settings’ you can save the previously selected displaying and segmentation options as well as loading them anew after an interruption in segmentation (buttons ‘save settings’ and ‘load settings’). Saving will generate an XML file which you will need to select when loading the settings (click on ‘load settings’, a new window will open; select file in question and open it). An additional feature will enable you to re-load previous pages’ segmentation results if you wish to view them again: simply go to ‘advanced settings’ and click on ‘load now’. This will load any previously saved XML file containing that page’s segmentation results.

Settings. fig. 20: Settings.

  • 'Reading Order': In order for the correct order of a page’s textual elements to be taken into account in all steps following segmentation, it is necessary to define these elements’ ‘reading order’ beforehand. This can be done automatically provided a book’s layout be relatively clear and simple. However, should you be working with a more complex layout structure, we recommend you proceed manually. Simply select ‘auto generate a reading order’ or ‘set a reading order’ under toolbar item ‘Order’.

Reading order selection in toolbar fig. 21. Reading order selection in toolbar

By clicking on the auto reading order button, a list of all the page’s textual elements will appear in the right sidebar (under ‘reading order’), sorted from top to bottom. On the other hand, if you wish to manually establish reading order, you will need to click on each of the page’s textual elements, in the correct order (see below), after which this reading order will appear in the aforementioned list. All elements of the reading order can be rearranged with a drag-and-drop or deleted by clicking on the corresponding recycle bin icon. As with everything in LAREX, the reading order can always be changed before saving the final segmentation results.

Exemplary page segmentation

With each page loading, LAREX automatically generates segmentation results – these only need to be subsequently corrected. The following, exemplary segmentation process uses page 4 of reference book Cirurgia, which you can download here when downloading the OCR4all folder structure.

Error analysis: Which layout elements were correctly recognised, which incorrectly, which weren’t at all? Are there any user marks in the margins, bordures, spots or elements of text which will influence segmentation, but you wish to avoid being recognised?

Auto generated results, Cirurgia page 4. fig. 22. Auto generated results, Cirurgia page 4.

'Region of interest' (RoI): Defining a RoI will help exclude certain sections of your page, situated outside the area later subjected to recognition but which can negatively impact segmentation (such as user marks, impurities, library stamps, etc.). To do so, go to toolbar and click on ‘Set the region of interest’ (under ‘RoI’), then use left click-and-hold to draw a rectangle around the page section you wish to segment.

Defining a 'region of interest'. fig. 23. Defining a 'region of interest'.

Once RoI has been defined, click on 'SEGMENT' button (right sidebar) – all element situated outside of RoI are now excluded from any further steps. Once RoI has been defined, it will be automatically transposed to all the book's scans. However, due to a wide array of factors, the page sections relevant to segmentation can shift from scan to scan. Therefore, as processing progresses, the user will likely need to adjust RoI from time to time. To do so, simply click on any RoI section and shift it using the mouse. Independently of RoI, the 'Create an ignore rectangle' option creates an 'ignore region' which allows for certain, small sections of a scan to be ignored and thus excluded from segmentation.

Correcting layout recognition flaws: Incorrectly recognized layout elements can be assigned a new typification manually: a right-hand click on said element will open a pop-up menu from which you can choose the correct designation.

Correcting a faulty typification. fig. 24. Correcting a faulty typification.

Should you need to separate a title from another textual element with which it is fused, there three ways to proceed: To begin, you can draw a rectangle around the section you wish to classify: proceed to toolbar, click on ‘Segment’ and select ‘Create a fixed segment rectangle’ (shortcut: 3); using mouse, draw a rectangle around the relevant section – a pop-up menu will appear from

which it’s correct designation/type can be chosen. Next, you can instead choose to use a polygon shape. This option is particularly suited to the more complex or chaotic layouts and/ or those comprising angled edges, rounded pictures and woodcuts, or ornate initials inside the text block. Proceed to toolbar, click on ‘Segment’, this time selecting ‘Create a fixed segment polygon’ option (shortcut: 4). Using the mouse, generate a dotted line to go around end encompass the entire relevant section – once the line’s end has been joined to its starting point, creating a polygon, the aforementioned pop-up menu will appear to allow for designation. Finally, you can also separate a text block – Initially recognized as one paragraph – into a title and main text using a cutting line: simply go to toolbar and ‘Segment’, and select ‘Create a cut line’ option (shortcut: 5).

Correcting a faulty typification. fig. 25. Toolbar: selecting cut line option.

Using left mouse key, create a line through the element you wish to separate, clicking along its path to adjust it as needed; end line with a double click.

Drawing a line between two layout elements to be separated. fig. 26. Drawing a line between two layout elements to be separated.

Click on 'Segment' in order to prompt separation. Afterwards, title element can be correctly renamed, using right-hand click and pop-up menu (as shown below).

Correcting typification of separated sections. fig. 27. Correcting typification of separated sections.

If at any time you with to delete layout components, inaccurate cutting lines or polygons, etc. simply click on the relevant element and use ‘Delete’ key or ‘Delete selected items’ option in the toolbar.

Determining 'Reading Order' (see below):

Determining reading order. fig. 28. Determining reading order.

Saving current scan’s segmentation results: Save your segmentation results by clicking on ‘Save results’ or with Ctrl + S. This will automatically generate an XML file containing those results inside the OCR4all folder structure.

Saving segmentation results. fig. 29. Saving segmentation results.

Afterwards, you can proceed to the next scan (left sidebar). If you wish to redo or change a scan’s segmentation, you can do as much at any time: simply save the new results – the previous XML file will be automatically deleted and replaced with a new version.

Additional processing options

OCR4all also provides the following scan processing options:

  • While deleting layout elements or joining separate ones to form one, single region, you can select all relevant elements simultaneously by pressing and holding 'Shift' key and drawing a rectangle around the entire region using your mouse. Relevant regions must be located entirely inside the rectangle. Once done, selected region will be surrounded by a blue frame.
  • 'Select contours to combine (with 'C') to segments (see function combine)' (shortcut: 6): this tool is perfect for reaching optimal segmentation results even when working with scans featuring a densely packed and detailed print layout. The basic idea is that layout elements only be delimited by the contours of the individual letters/pictures they are composed of, thus solving the problems created by manual segmentation such as excessively broad margins, which can in turn hamper the OCR performance. To use this feature, click on the relevant button (toolbar) or use shortcut 6. All components of the scan recognized as layout elements will be coloured blue.

Showing contours. fig. 30. Showing contours.

Select individual letters or even parts of letter by clicking on them.

Selecting contours. fig. 31. Selecting contours.

You can also apply your selection to an individual group of letters, entire words or text lines, sections of a layout element (see above: ‘Shift’ + selection with rectangle). Use shortcut C after selection in order to include all selected items – be they letters, words, lines, etc. – in one, new layout element, regardless of the layout region they had previously belonged to. This new element’s edges will be far more precise that those of an automatically generated one, enabling a particularly accurate segmentation superior to that of standardised tools.

Aggregating selected items to create new element. fig. 32. Aggregating selected items to create new element.

Save new element by clicking on ‘Segment’. New element can be renamed as described above.

Typifying new layout element. fig. 33. Typifying new layout element.

  • 'Combine selected segments or contours' (shortcut: C): In order to combine several, distinct layout elements into a new element, select the entire region in question (see above) and click on corresponding button (toolbar) or use shortcut C.
  • 'Fix/unfix segments, for it to persist a new auto segmentation' (Shortcut: F): This function enables you to fix an element in one place beyond your next segmentation rounds. Mark element in question by clicking on it, then use shortcut F or corresponding button in toolbar. Fixed, i. e. pinned elements will appear surrounded by a dotted line. If you wish to cancel fixation, simply repeat the operation.
  • Zoom: Use mouse wheel to zoom in and out of display. Use space key to reset display to its original size.
  • When working with a very complex and intricate layout, targeted interventions can help increase the precision and quality of segmentation results. The contours of all layout elements (recognized as such) consist in fact of many individual lines, separated by dots.

Layout element contours. fig. 34. Layout element contours.

  • These tiny dots can be moved, individually or in groups, e.g. to avoid collision between different layout elements in a dense setting. Use a left click-and-hold to move a dot, click on the line to create a new dot, use 'delete' key to delete a dot.
  • Load results: a scan’s existing segmentation results will be sourced from OCR4all folder structure and directly loaded to LAREX.

Final steps with LAREX

Once a document’s entire segmentation has been completed with LAREX (i.e. once segmentation results have been saved for all pages), results can be found in the OCR4all folder structure. In order to make sure that results were correctly saved, simply go to menu item ‘post correction’, in the ‘segments’ bar (see below).

Line Segmentation

Input: pre-processed images and segmentation information (in the form of PAGE XML files)
Output: extracted text lines saved in those PAGE XML files

  • This step constitutes a direct preparation to the OCR process and features the dissection of all previously defined and classified layout elements into separate text lines (this a necessary step as the OCR is based on line recognition). All results are then automatically saved in the corresponding page XML files.

Line segmentation settings. fig. 35. Line segmentation settings.

  • Generally speaking, all existing settings can be retained. There are, however, a few restrictions when it comes to page layout: if you are working with pages featuring two or more text columns (and if those have been previously defined as separate, individual main text blocks in LAREX), you will need to change the ‘maximum # of whitespace column separators’ which is pre-set at -1.
    • 'Whitespace column separators' are the white columns devoid of text found around text blocks.
    • When working with a two-column layout whose text is continuous (i.e. where the first line of both columns don’t form a semantic unit), you will need to set the ‘maximum # of whitespace column separators’ at 3. This number corresponds to the whitespace on both sides of the columns and to the whitespace situated between them.
    • When working with a three-column layout, set the 'whitespace' number to 4, and so on.
  • Once all desired settings are chosen, click on ‘execute’. Afterwards, control generated results under ‘Project Overview’.
  • Using the ‘settings (advanced)’ option is especially useful when working with line segmentation, particularly if/when errors are reported (and shown on the interface). For instance, small letters will often fall short of the default minimal line width. You can adjust this minimal width by reducing the ‘minimum scale permitted’, which can be found under menu item ‘limits’. This will enable you to correctly re-do the line segmentation.
  • You can generally control the accuracy of line segmentation by clicking on the ‘lines’ button (under menu item ‘post correction’).

Recognition

Input: Text lines and one or more OCR models
Output: OCR-output in the form of text for each of the PAGE XML files at hand

  • This step is where the actual text recognition takes place based on the individual lines and textual layout elements identified during line segmentation (see above).
  • Select menu item 'Recognition': in the right sidebar, you will only find your document's scans (or rather printed pages) for which all OCR pre-processing steps have been completed, by which we mean all previously explained steps - bar 'noise removal'. Please select the scans for which you wish to produce an OCR text.
  • Go to 'line recognition models' (under 'available') and select all models or model packages relevant to the typographical recognition of your text (e.g. early modern/historical Gothic type, italic/cursive type, historical Antiqua etc.). We expressly advise the use of a model package, where five models simultaneously work and interact with each other! This is much preferable to using only one model at a time. You can select all models you wish to add to your package by clicking on each of them - they will automatically be added to the 'selected' category. When dealing with a large amount of models, you can find them by using the 'search' function.

Selection of model package for text recognition. fig. 36. Selection of model package for text recognition.

  • You likely won't need to adjust any of the advanced settings.
  • Click on 'execute' and oversee the text recognition progress on the console.
  • Once recognition is finished, you will be able to view all results under menu item 'ground truth production'.

Ground Truth Production

Input: text line images and their corresponding OCR output when available
Output: line based ground truth

  • Under menu item 'ground truth production' you will be able to view the texts generated during 'recognition', correct them and save them as a training model. This is the so called 'ground truth'.
  • The correction tool used in this step is divided into two parts. On the left handside are the (selectable) scans. In the middle, you will find the segmented text line images (see above for workflow) as well as their corresponding OCR text lines, placed directly underneath. We call this standard display 'text view'.

Ground truth production with 'text view'. fig. 37. Ground truth production with 'text view'.

Clicking on the 'Switch to page view' button will bring you to the so called 'page view' display, in which you can work on all text lines while they are displayed in relation to the entire page layout. By clicking 'switch to text view', you will return to the initial 'text view' display.

Ground truth production with 'page view'. fig. 38. Ground truth production with 'page view'.

  • On the right hand side of the display, you will find the virtual keyboard, with which you can set special characters such as ligatures, abbreviation, diacritical signs etc. Simply place your cursor where you with to insert a special character and then click on said character in the virtual keyboard. In order to add new characters to the virtual keyboard, simply click on the plus icon, add character through copy and paste in the blank and click on 'save'. if you wish to delete characters from the virtual keyboard, drag and drop said character on the recycle bin icon. Once all necessary/desired changes have been made, click on 'save' and 'lock'. Using buttons 'lad' and 'save' will ultimately enable you to save different virtual keyboards specific to any particular document. Once a virtual keyboard has been saved as such, it can be re-loaded at any time, which is particularly useful when you need to interrupt correction - or if you want to use this keyboard for another document for which it is suited.
  • In order to correct individual lines in 'text view' mode, click on the line in question: you can now correct and edit it. (When working with 'page view', you will need to click on the line you wish to edit first, after which a text field will appear in which you will be able to proceed to corrections/edits as well. Use 'tabulator' key to go to the next line, and so on. All following steps are identical in both viewers. Once a text line has been completely and satisfactorily corrected, press 'enter key'. The line will be coloured green, meaning it will be automatically saved as 'ground truth' in OCR4all once the entire page has been completed and saved (by clicking on 'save result' or using shortcut crtl + S). Once a line has been identified as ground truth, it can be used as a basis for OCR training as well as a tool to evaluate the OCR model you used.
  • If there are erroneously recognised text line images among your pairs of text lines images and corresponding OCR text lines, please let your OCR text lines unfilled to not cause problems during the OCR model training.
  • Were you to conclude, while working on ground truth production, that the quality of the text recognition achieved with mixed models wasn't satisfactory, you can always perform a final, manual text correction by employing a training model targeted towards the specific kind of document you are working on. Proceeding to this step will generally increase the recognition quality and percentage.

Evaluation

Input: line based OCR texts and corresponding ground truth
Output: error statistics

  • Under menu item 'evaluation', users can check on the recognition rate of the model(s) currently under use.

  • In order to generate an evaluation, go to right sidebar and select all the scans recognized with the help of said model and subsequently corrected during 'ground truth production'.

  • Click on 'execute': a chart will appear in the console. At the top, you will see the percentage of errors as well as the full count of errors ('errs'). All identified errors are listed underneath, displayed as a chart featuring the comparison between the initially recognized text ('PRED', righthand column) and the results of ground truth production ('GT', lefthand column). Behind each error item, you will see the frequency of that particular type of error as well as its percentage compared to the entire error count.

Evaluation results with general error rate, ten most frequent errors as well as their percentage
 compared to entire error count. fig. 39. Evaluation results with general error rate, ten most frequent errors as well as their percentage compared to entire error count.

  • Thanks to the spreadsheet and its display (100% - error rate), users can evaluate whether a new training using individual, targeted models is necessary.

Training

Input: text line images with corresponding ground truth (as an option, existing OCR models can be included as well, which are used as so called 'pre-training' and as basis for model training
Output: one or more OCR model(s)

The aim of our software is to produce a text containing as few errors as possible. In that case, why is even necessary to use the training module and produce models targeted to your document, instead of simply correcting it manually? In fact, the better a recognition model the shorter the correction time. The idea of a continuous model training is to train increasingly better models through continuous corrections, which in turn will reduce the amount of corrections needed for the next pages, and so on.

  • With this training tool, users will be able to train models tailored to their document, based on the lines of ground truth available for this document. In order to begin training, please proceed to the following adjustments in general settings:
    • Set the 'Number of folds to train' (i.e. the number of models to train) to 5. → Training will occur with a model package containing five individual models.
    • 'Only train a single fold box': please don't fill out this box!
    • Set the 'Number of models to train in parallel' at -1. → All training models will be trained simultaneously.
    • If all characters contained in the pretraining model need to be kept in the model you wish to train (i.e. added to its so called whitelist), please check the 'Keep codec of the loaded model(s)' box.
    • In effect, the 'Whitelist characters to keep in the model' is the exhaustive list of characters used during training and in the subsequently generated model. Any character not contained in the whitelist won't be included in the process.
    • 'Pretraining': Either 'Train each model based on different existing models' (a menu will appear containing five dropdown lists. Inside each of them, enter one of the five models belonging to the model package used as advised earlier. Regardless of the training step (be it the first round or the third), always enter the five models used since the beginning) or 'Train all models based on one existing model' (click on this setting if you started training using only one model. Simply select that exact training model for each repetition of the training process).
    • 'Data augmentation': Please don't fill out this box! This function describes the data augmentation per line. Users can enter a number, e.g. 5, in order to increase the amount of training material. This can lead to the generation of better performing models. However, this process is more time-costly than the standard route.
    • 'Skip retraining on real data only': Please don't fill out this box!
  • The advanced settings remain unchanged.

Settings for the training of document-specific models. fig. 40. Settings for the training of document-specific models.

  • Click on 'execute' to start training. You will be able to view the training progress at any time in the console. Training time will vary depending on the total amount of ground truth lines.
  • In accordance with the aforementioned settings, a model package (containing five individual models and tailored to your document's exact needs) will be generated through training and automatically saved in folder ocr4all/models/document title/0. Going forward, this model package will be labelled '0'. From this point on, while working on this document and striving towards improving recognition, you will be able to select said package under menu item 'recognition' among other models, when working with new pages from the same document. If you wish to generate a second document-specific model package (e.g. to improve the first one's weaknesses), simply repeat the process as described above. This new model will be labelled '1', and so on.

Post Correction

Input: segmentation information and metadata on pre-processed scans, as well as the corresponding text
Output: corrected/improved segmentation info and text Under menu item 'post correction', users will be able to manually adjust and correct all segmentation info and text generated through the course of the previous sub-modules. This sub-module is itself divided into three levels:

  • The item 'segment' (i.e. level 1) will enable you to adjust all regions determined during segmentation and their reading order, page after page. You will recognize a few of the tools from working with LAREX (see above). Please note that all changes undertaken at this level will have consequences for the following levels. For example, if you decide to delete a certain region during level 1, you will loose all text lines belonging to this region going forward.
  • The 'lines' item (i.e. level 2) enables you to manually adjust automatic line recognition. You will be able to add lines where there were none, to change their shape or position, or to delete them. The reading order can be adjusted as well, on a line basis.

Adjusting line-based reading order during post correction. fig. 41. Adjusting line-based reading order during post correction.

  • Under item 'text' (i.e. level 3), you will find the afore-described ground truth submodule, in which the text content of your lines can be corrected once more.

Result Generation

Input: line-based OCR results, ground truth (optional - only if at hand) and the LAREX-segmentation and line-segmentation data
Output: final text output (lines will be re-grouped into pages and full-text) as well as page based PAGE XML

Result Generation. fig. 42: Result Generation.

  • Once the user considers all recognition and correction steps to be finalized, results can be generated as TXT or XML files, saved under ocr4all/data/results.
  • You can choose whether you need a text or PAGE XML file under 'settings'. If you opt for a text file, individual TXT files will be generated for each scan as well as an additional one containing your document's entire text.
  • PAGE XML files are also generated on a page-base and additionally contain data about creation date, last changes in the file, metadata about each page's corresponding scan, about the page's size, its layout regions and their exact coordinates, its reading order, its text lines and their text content.
',128);function $(i,H,V,Z,J,K){return l(),n("div",null,[t("h1",F,[a(s(i.$frontmatter.title)+" ",1),Y]),B])}const te=o(N,[["render",$]]);export{ee as __pageData,te as default}; diff --git a/assets/guide_user-guide_workflow.md.BxdBwB_I.lean.js b/assets/guide_user-guide_workflow.md.CNo_DpnQ.lean.js similarity index 97% rename from assets/guide_user-guide_workflow.md.BxdBwB_I.lean.js rename to assets/guide_user-guide_workflow.md.CNo_DpnQ.lean.js index 09a7760c..750223d3 100644 --- a/assets/guide_user-guide_workflow.md.BxdBwB_I.lean.js +++ b/assets/guide_user-guide_workflow.md.CNo_DpnQ.lean.js @@ -1 +1 @@ -import{_ as o,c as n,j as t,a,t as s,a4 as r,o as l}from"./chunks/framework.CI6U-QuP.js";const d="/images/user-guide/workflow/process_flow.png",c="/images/user-guide/workflow/selection_of_an_appropiate_ocr_model.png",g="/images/user-guide/workflow/individual_lines_with_their_corresponding_ocr_results.png",u="/images/user-guide/workflow/pre-processing_settings.png",h="/images/user-guide/workflow/noise_removal_settings.png",p="/images/user-guide/workflow/LAREX_settings.png",m="/images/user-guide/workflow/LAREX_interface_with_automatic_segmentation_results.png",f="/images/user-guide/workflow/toolbar.png",w="/images/user-guide/workflow/sidebar_settings.png",y="/images/user-guide/workflow/range_options_regions.png",b="/images/user-guide/workflow/layout_regions_display_and_template.png",v="/images/user-guide/workflow/defining_new_layout_regions.png",k="/images/user-guide/workflow/parameters_settings.png",_="/images/user-guide/workflow/settings.png",x="/images/user-guide/workflow/toolbar_reading_order.png",R="/images/user-guide/workflow/auto_generated_results.png",T="/images/user-guide/workflow/region_of_interest.png",e="/images/user-guide/workflow/correcting_a_faulty_typification.png",O="/images/user-guide/workflow/drawing_a_line.png",C="/images/user-guide/workflow/correcting_typification.png",q="/images/user-guide/workflow/determining_reading_order.png",A="/images/user-guide/workflow/saving_segmentation_result.png",I="/images/user-guide/workflow/showing_contours.png",S="/images/user-guide/workflow/selecting_contours.png",P="/images/user-guide/workflow/aggregating_selected_items.png",L="/images/user-guide/workflow/typifying_new_layout_element.png",E="/images/user-guide/workflow/layout_element_contours.png",X="/images/user-guide/workflow/line_segmentation_settings.png",z="/images/user-guide/workflow/selection_of_model_package.png",j="/images/user-guide/workflow/ground_truth_production_text_view.png",G="/images/user-guide/workflow/ground_truth_production_page_view.png",D="/images/user-guide/workflow/evaluation.png",W="/images/user-guide/workflow/settings_for_training.png",U="/images/user-guide/workflow/adjusting_line_based_reading_order.png",M="/images/user-guide/workflow/result_generation.png",ee=JSON.parse('{"title":"Workflow","description":"","frontmatter":{"title":"Workflow"},"headers":[],"relativePath":"guide/user-guide/workflow.md","filePath":"guide/user-guide/workflow.md","lastUpdated":1724226141000}'),N={name:"guide/user-guide/workflow.md"},F={id:"frontmatter-title",tabindex:"-1"},Y=t("a",{class:"header-anchor",href:"#frontmatter-title","aria-label":'Permalink to "{{ $frontmatter.title }}"'},"​",-1),B=r("",128);function $(i,H,V,Z,J,K){return l(),n("div",null,[t("h1",F,[a(s(i.$frontmatter.title)+" ",1),Y]),B])}const te=o(N,[["render",$]]);export{ee as __pageData,te as default}; +import{_ as o,c as n,j as t,a,t as s,a4 as r,o as l}from"./chunks/framework.CI6U-QuP.js";const d="/images/user-guide/workflow/process_flow.png",c="/images/user-guide/workflow/selection_of_an_appropiate_ocr_model.png",g="/images/user-guide/workflow/individual_lines_with_their_corresponding_ocr_results.png",u="/images/user-guide/workflow/pre-processing_settings.png",h="/images/user-guide/workflow/noise_removal_settings.png",p="/images/user-guide/workflow/LAREX_settings.png",m="/images/user-guide/workflow/LAREX_interface_with_automatic_segmentation_results.png",f="/images/user-guide/workflow/toolbar.png",w="/images/user-guide/workflow/sidebar_settings.png",y="/images/user-guide/workflow/range_options_regions.png",b="/images/user-guide/workflow/layout_regions_display_and_template.png",v="/images/user-guide/workflow/defining_new_layout_regions.png",k="/images/user-guide/workflow/parameters_settings.png",_="/images/user-guide/workflow/settings.png",x="/images/user-guide/workflow/toolbar_reading_order.png",R="/images/user-guide/workflow/auto_generated_results.png",T="/images/user-guide/workflow/region_of_interest.png",e="/images/user-guide/workflow/correcting_a_faulty_typification.png",O="/images/user-guide/workflow/drawing_a_line.png",C="/images/user-guide/workflow/correcting_typification.png",q="/images/user-guide/workflow/determining_reading_order.png",A="/images/user-guide/workflow/saving_segmentation_result.png",I="/images/user-guide/workflow/showing_contours.png",S="/images/user-guide/workflow/selecting_contours.png",P="/images/user-guide/workflow/aggregating_selected_items.png",L="/images/user-guide/workflow/typifying_new_layout_element.png",E="/images/user-guide/workflow/layout_element_contours.png",X="/images/user-guide/workflow/line_segmentation_settings.png",z="/images/user-guide/workflow/selection_of_model_package.png",j="/images/user-guide/workflow/ground_truth_production_text_view.png",G="/images/user-guide/workflow/ground_truth_production_page_view.png",D="/images/user-guide/workflow/evaluation.png",W="/images/user-guide/workflow/settings_for_training.png",U="/images/user-guide/workflow/adjusting_line_based_reading_order.png",M="/images/user-guide/workflow/result_generation.png",ee=JSON.parse('{"title":"Workflow","description":"","frontmatter":{"title":"Workflow"},"headers":[],"relativePath":"guide/user-guide/workflow.md","filePath":"guide/user-guide/workflow.md","lastUpdated":1724832369000}'),N={name:"guide/user-guide/workflow.md"},F={id:"frontmatter-title",tabindex:"-1"},Y=t("a",{class:"header-anchor",href:"#frontmatter-title","aria-label":'Permalink to "{{ $frontmatter.title }}"'},"​",-1),B=r("",128);function $(i,H,V,Z,J,K){return l(),n("div",null,[t("h1",F,[a(s(i.$frontmatter.title)+" ",1),Y]),B])}const te=o(N,[["render",$]]);export{ee as __pageData,te as default}; diff --git a/assets/index.md.BC2x8Von.js b/assets/index.md.BC2x8Von.js new file mode 100644 index 00000000..657eda07 --- /dev/null +++ b/assets/index.md.BC2x8Von.js @@ -0,0 +1 @@ +import{_ as e,c as t,o as a}from"./chunks/framework.CI6U-QuP.js";const m=JSON.parse('{"title":"OCR4all","titleTemplate":"Setup guide, user guide, developer documentation and more.","description":"","frontmatter":{"layout":"home","title":"OCR4all","titleTemplate":"Setup guide, user guide, developer documentation and more.","hero":{"name":"OCR4all","text":"Optical Character Recognition (and more) for everyone","tagline":"Setup guide, user guide, developer documentation and more.","image":{"src":"/assets/brand/logo.svg","alt":"OCR4all"},"actions":[{"theme":"brand","text":"Get Started","link":"/guide/setup-guide/quickstart"},{"theme":"alt","text":"User Guide","link":"/guide/user-guide/introduction"},{"theme":"brand","text":"✨ Beta Release 1.0","link":"/beta"}]},"features":[{"title":"Fully free and open-source","details":"OCR4all is and will stay completely free and open-source. No subscriptions, paywalled features or private code."},{"title":"Flexible applicable","details":"From the high-quality processing of challenging manuscripts to the mass full-text recognition of printings"},{"title":"Powerful layout and text annotation included","details":"Manually annotate, correct or compare layout and text elements using the powerful LAREX editor."},{"title":"OCR-D compatible","details":"All future versions of OCR4all are built to be fully compatible with the OCR-D ecosystem"},{"title":"Designed with usability in mind","details":"Create complex OCR workflows through the UI without the need of interacting with code or command line interfaces."},{"title":"Easy cross-platform deployment","details":"Docker and a single command are all it takes to get OCR4all up and running, regardless of your operating system."}]},"headers":[],"relativePath":"index.md","filePath":"index.md","lastUpdated":1724832369000}'),i={name:"index.md"};function l(o,n,r,s,d,u){return a(),t("div")}const p=e(i,[["render",l]]);export{m as __pageData,p as default}; diff --git a/assets/index.md.BC2x8Von.lean.js b/assets/index.md.BC2x8Von.lean.js new file mode 100644 index 00000000..657eda07 --- /dev/null +++ b/assets/index.md.BC2x8Von.lean.js @@ -0,0 +1 @@ +import{_ as e,c as t,o as a}from"./chunks/framework.CI6U-QuP.js";const m=JSON.parse('{"title":"OCR4all","titleTemplate":"Setup guide, user guide, developer documentation and more.","description":"","frontmatter":{"layout":"home","title":"OCR4all","titleTemplate":"Setup guide, user guide, developer documentation and more.","hero":{"name":"OCR4all","text":"Optical Character Recognition (and more) for everyone","tagline":"Setup guide, user guide, developer documentation and more.","image":{"src":"/assets/brand/logo.svg","alt":"OCR4all"},"actions":[{"theme":"brand","text":"Get Started","link":"/guide/setup-guide/quickstart"},{"theme":"alt","text":"User Guide","link":"/guide/user-guide/introduction"},{"theme":"brand","text":"✨ Beta Release 1.0","link":"/beta"}]},"features":[{"title":"Fully free and open-source","details":"OCR4all is and will stay completely free and open-source. No subscriptions, paywalled features or private code."},{"title":"Flexible applicable","details":"From the high-quality processing of challenging manuscripts to the mass full-text recognition of printings"},{"title":"Powerful layout and text annotation included","details":"Manually annotate, correct or compare layout and text elements using the powerful LAREX editor."},{"title":"OCR-D compatible","details":"All future versions of OCR4all are built to be fully compatible with the OCR-D ecosystem"},{"title":"Designed with usability in mind","details":"Create complex OCR workflows through the UI without the need of interacting with code or command line interfaces."},{"title":"Easy cross-platform deployment","details":"Docker and a single command are all it takes to get OCR4all up and running, regardless of your operating system."}]},"headers":[],"relativePath":"index.md","filePath":"index.md","lastUpdated":1724832369000}'),i={name:"index.md"};function l(o,n,r,s,d,u){return a(),t("div")}const p=e(i,[["render",l]]);export{m as __pageData,p as default}; diff --git a/assets/index.md.BMIi4gsl.js b/assets/index.md.BMIi4gsl.js deleted file mode 100644 index 86a0b891..00000000 --- a/assets/index.md.BMIi4gsl.js +++ /dev/null @@ -1 +0,0 @@ -import{_ as e,c as t,o as a}from"./chunks/framework.CI6U-QuP.js";const p=JSON.parse('{"title":"OCR4all","titleTemplate":"Setup guide, user guide, developer documentation and more.","description":"","frontmatter":{"layout":"home","title":"OCR4all","titleTemplate":"Setup guide, user guide, developer documentation and more.","hero":{"name":"OCR4all","text":"Optical Character Recognition (and more) for everyone","tagline":"Setup guide, user guide, developer documentation and more.","image":{"src":"/assets/brand/logo.svg","alt":"OCR4all"},"actions":[{"theme":"brand","text":"Get Started","link":"/guide/setup-guide/quickstart"},{"theme":"alt","text":"User Guide","link":"/guide/user-guide/introduction"}]},"features":[{"title":"Fully free and open-source","details":"OCR4all is and will stay completely free and open-source. No subscriptions, paywalled features or private code."},{"title":"Flexible applicable","details":"From the high-quality processing of challenging manuscripts to the mass full-text recognition of printings"},{"title":"Powerful layout and text annotation included","details":"Manually annotate, correct or compare layout and text elements using the powerful LAREX editor."},{"title":"OCR-D compatible","details":"All future versions of OCR4all are built to be fully compatible with the OCR-D ecosystem"},{"title":"Designed with usability in mind","details":"Create complex OCR workflows through the UI without the need of interacting with code or command line interfaces."},{"title":"Easy cross-platform deployment","details":"Docker and a single command are all it takes to get OCR4all up and running, regardless of your operating system."}]},"headers":[],"relativePath":"index.md","filePath":"index.md","lastUpdated":1724226141000}'),i={name:"index.md"};function o(l,n,r,s,d,u){return a(),t("div")}const m=e(i,[["render",o]]);export{p as __pageData,m as default}; diff --git a/assets/index.md.BMIi4gsl.lean.js b/assets/index.md.BMIi4gsl.lean.js deleted file mode 100644 index 86a0b891..00000000 --- a/assets/index.md.BMIi4gsl.lean.js +++ /dev/null @@ -1 +0,0 @@ -import{_ as e,c as t,o as a}from"./chunks/framework.CI6U-QuP.js";const p=JSON.parse('{"title":"OCR4all","titleTemplate":"Setup guide, user guide, developer documentation and more.","description":"","frontmatter":{"layout":"home","title":"OCR4all","titleTemplate":"Setup guide, user guide, developer documentation and more.","hero":{"name":"OCR4all","text":"Optical Character Recognition (and more) for everyone","tagline":"Setup guide, user guide, developer documentation and more.","image":{"src":"/assets/brand/logo.svg","alt":"OCR4all"},"actions":[{"theme":"brand","text":"Get Started","link":"/guide/setup-guide/quickstart"},{"theme":"alt","text":"User Guide","link":"/guide/user-guide/introduction"}]},"features":[{"title":"Fully free and open-source","details":"OCR4all is and will stay completely free and open-source. No subscriptions, paywalled features or private code."},{"title":"Flexible applicable","details":"From the high-quality processing of challenging manuscripts to the mass full-text recognition of printings"},{"title":"Powerful layout and text annotation included","details":"Manually annotate, correct or compare layout and text elements using the powerful LAREX editor."},{"title":"OCR-D compatible","details":"All future versions of OCR4all are built to be fully compatible with the OCR-D ecosystem"},{"title":"Designed with usability in mind","details":"Create complex OCR workflows through the UI without the need of interacting with code or command line interfaces."},{"title":"Easy cross-platform deployment","details":"Docker and a single command are all it takes to get OCR4all up and running, regardless of your operating system."}]},"headers":[],"relativePath":"index.md","filePath":"index.md","lastUpdated":1724226141000}'),i={name:"index.md"};function o(l,n,r,s,d,u){return a(),t("div")}const m=e(i,[["render",o]]);export{p as __pageData,m as default}; diff --git a/beta/index.html b/beta/index.html new file mode 100644 index 00000000..89c6fd0c --- /dev/null +++ b/beta/index.html @@ -0,0 +1,24 @@ + + + + + + OCR4all 1.0 | OCR4all + + + + + + + + + + + + + +
Skip to content

OCR4all 1.0

Core Features

  • 👥 User and group management. Share your all your projects and files with other users and groups on the same instance.
  • ⚙️ Wide array of OCR processor, powered by OCR-D and others.
  • 🗂️ Fully fledged in-app data management. Upload and manage your images, models, workflows and datasets completely through the UI.
  • 📥 Export all uploaded and generated data, import them into other OCR4all instances or use them wherever you want.
  • 👑 Full data sovereignty. No data leaves your instance unless approved by you or the instance administrator.
  • 💪 Generate training data and use it to train or fine-tune models.
  • 🆓 OCR4all is and will always stay free and open-source.
  • and much more...

Next steps

+ + + + \ No newline at end of file diff --git a/beta/introduction.html b/beta/introduction.html new file mode 100644 index 00000000..74710d62 --- /dev/null +++ b/beta/introduction.html @@ -0,0 +1,24 @@ + + + + + + OCR4all 1.0 – Introduction | OCR4all + + + + + + + + + + + + + +
Skip to content

OCR4all 1.0 – Introduction

Motivation and General Idea

  • Availability of Solutions: Numerous high-performance open-source solutions for Automatic Text Recognition (ATR) are already available, with new releases emerging continuously.
  • Diverse Use Cases: The highly heterogeneous nature of use cases necessitates the targeted deployment of specialized ATR solutions.
  • Requirement: There is a need for user-friendly frameworks that facilitate the flexible, integrable, and sustainable combination and application of both existing and future ATR solutions.
  • Objective: Our goal is to empower users to perform ATR independently, achieving high-quality results.
  • Foundation: This framework is built upon freely available tools, enhanced by our in-house developments.

OCR-D and OCR4all

  • OCR-D Initiative: The DFG-funded OCR-D initiative is dedicated to facilitating the mass full-text transformation of historical prints published in the German-speaking world.
  • Focus Areas: OCR-D emphasizes interoperability and connectivity, ensuring a high degree of flexibility and sustainability in its solutions.
  • Integrated Solutions: The initiative combines multiple ATR solutions within a unified framework, enabling precise adaptation to specific materials and use cases.
  • Open Source Commitment: All results from the OCR-D project are released as completely open-source.
  • OCR4all-Libraries Project: The DFG-funded OCR4all-libraries project has two primary goals:
    • Providing a user-friendly interface for OCR-D solutions via OCR4all, enabling independent use by non-technical users.
    • Enhancing the ATR output within OCR4all to offer added value to even the most technically experienced users.

System Architecture

  • Modularity and Interoperability: The framework is designed with a strong focus on modularity and interoperability, ensuring seamless integration and adaptability.
  • Distributed Infrastructure: The architecture features a distributed infrastructure, with a clear separation between the backend and frontend components.
    • Backend: Built with Java and Spring Boot.
    • Frontend: Developed using the Vue.js ecosystem.
  • Component Communication: Components communicate via a REST API, enabling efficient interaction between different parts of the system.
  • Integration of Third-Party Solutions: Service Provider Interfaces (SPIs) allow for the integration of third-party solutions, such as ATR processors.
  • Containerized Setup: The containerized architecture ensures easy distribution and deployment of all integrated components with minimal barriers.
  • Data Sovereignty: Users retain full control over their data, with no data leaving the instance without explicit user or administrator consent.
  • Reproducibility: Every step in the process is fully reproducible. A "transcript of records" feature stores detailed information about the processors and parameters used, ensuring transparency and repeatability.

Modules

Data Management and Processing

  • Separation of Functions: Data management and processing are strictly separated to ensure efficient handling and security.
  • Data Sharing: Data can be shared with different users or user groups as needed.

Processors and NodeFlow

  • Wide Array of Processors: A diverse range of ATR processors is available, including OCR-D and external options.
  • Ease of Integration: New processors can be easily implemented via a well-defined interface, with the user interface generated automatically.
  • NodeFlow: The graphical editor NodeFlow simplifies the creation of workflows, making it convenient for users to design and customize processing sequences.

LAREX

  • Result Correction and Training Data Creation: LAREX allows for the correction of all ATR workflow results and the creation of training data.
  • Visual Workflow Identification: LAREX helps users identify the most suitable workflows as a visual explanation component.

Datasets, Training, and Evaluation

  • Dataset Creation: Datasets can be created with the option to use tagging and import functionalities.
  • Dataset Enrichment: Datasets can be enriched with training data generated and tagged within the application, even across various projects and sources.
  • Model Training: Train models on selected datasets or subsets thereof, with options for in-app usage or exporting both models and associated training data.
  • Model Evaluation: Evaluate both trained and imported models using curated datasets to ensure quality and accuracy.

Working with OCR4all 1.0

One Tool, Two Modes

Base ModePro Mode
Designed for novice users, with reduced complexity and a strongly guided, linear workflowTailored for experienced users who require more exploration and complexity
Pre-selected solutions for each processing stepUnrestricted access to all processors, parameters, and features
Pre-filtered parameters and limited access to advanced featuresSupport for identifying the best workflows and models for specific needs

INFO

Currently only pro mode is available in the beta release. The base mode will be added shortly.

Example Use Cases and Application Scenarios

Fully Automatic Mass Full-Text Digitalization

  • Objective: Maximize throughput with minimal manual effort.
  • Users: Libraries and archives processing large volumes of scanned materials.
  • Approach: Use the pro mode (NodeFlow, LAREX, and datasets) to identify the most suitable workflow.

Flawless Transcription of Source Material

  • Objective: Achieve maximum quality, accepting significant manual effort.
  • Users: Humanist researchers preparing text for a digital edition.
  • Approach: Utilize the base mode for iterative transcription with continually improving accuracy.

Building Corpora for Quantitative Applications

  • Objective: Maximize quality while minimizing manual effort.
  • Users: Researchers constructing corpora for training and evaluating quantitative methods.
  • Approach: Manage data and consistently retrain source-specific or mixed models using datasets and tagging functionalities.
+ + + + \ No newline at end of file diff --git a/beta/setup.html b/beta/setup.html new file mode 100644 index 00000000..941829d2 --- /dev/null +++ b/beta/setup.html @@ -0,0 +1,81 @@ + + + + + + OCR4all 1.0 – Setup | OCR4all + + + + + + + + + + + + + +
Skip to content

OCR4all 1.0 – Setup

If you want to try out the beta version of release 1.0 of OCR4all you can simply use the following Docker Compose file or download it here.

The prerequisite for this is having both Docker and Docker Compose installed.

A more in-depth installation guide will follow with the stable release of OCR4all 1.0.

WARNING

This will install a beta version of OCR4all 1.0 which may still contain some bugs and many features are yet to come.

version: "3.9"
+
+services:
+  msa-calamari:
+    hostname: msa-calamari
+    build:
+      context: ocr4all-app-calamari-msa
+      dockerfile: Dockerfile
+      args:
+        - TAG=${CALAMARI_TAG:-20240502}
+        - JAVA_VERSION=${CALAMARI_JAVA_VERSION:-17}
+        - APP_VERSION=${CALAMARI_APP_VERSION:-1.0-SNAPSHOT}
+    user: "${UID:-}"
+    restart: always
+    environment:
+      - SPRING_PROFILES_ACTIVE=${CALAMARI_PROFILES:-logging-debug,msa-api,docker}
+    volumes:
+      - ${OCR4ALL_DATA:-~/ocr4all/docker/data}:/srv/ocr4all/data
+      - ${OCR4ALL_ASSEMBLE:-~/ocr4all/docker/assemble}:/srv/ocr4all/assemble
+      - ${OCR4ALL_WORKSPACE_PROJECT:-~/ocr4all/docker/workspace/projects}:/srv/ocr4all/projects
+    ports:
+      - "${CALAMARI_API_PORT:-127.0.0.1:9092}:8080"
+  msa-ocrd:
+    hostname: msa-ocrd
+    build:
+      context: ocr4all-app-ocrd-msa
+      dockerfile: Dockerfile
+      args:
+        - TAG=${OCRD_TAG:-2024-04-29}
+        - JAVA_VERSION=${OCRD_JAVA_VERSION:-17}
+        - APP_VERSION=${OCRD_APP_VERSION:-1.0-SNAPSHOT}
+    user: "${UID:-}"
+    restart: always
+    environment:
+      - SPRING_PROFILES_ACTIVE=${OCRD_PROFILES:-logging-debug,msa-api,docker}
+    volumes:
+      - ${OCR4ALL_WORKSPACE_PROJECT:-~/ocr4all/docker/workspace/projects}:/srv/ocr4all/projects
+      - ${OCR4ALL_RESOURCES_ORCD:-~/ocr4all/docker/opt/ocr-d/resources}:/usr/local/share/ocrd-resources
+    ports:
+      - "${OCRD_API_PORT:-127.0.0.1:9091}:8080"
+  server:
+     build:
+      context: ocr4all-app
+      dockerfile: Dockerfile
+      args:
+        - TAG=${OCR4ALL_TAG:-17-jdk-slim}
+        - APP_VERSION=${OCR4ALL_APP_VERSION:-1.0-SNAPSHOT}
+    user: "${UID:-}"
+    restart: always
+    environment:
+      - SPRING_PROFILES_ACTIVE=${OCR4ALL_PROFILES:-logging-debug,server,api,documentation,docker}
+    volumes:
+      - ${OCR4ALL_HOME:-~/ocr4all/docker}:/srv/ocr4all
+    ports:
+      - "${OCR4ALL_API_PORT:-9090}:8080"
+    depends_on:
+      - msa-calamari
+      - msa-ocrd
+ + + + \ No newline at end of file diff --git a/guide/setup-guide/linux.html b/guide/setup-guide/linux.html index 9e4cee62..8fa7a998 100644 --- a/guide/setup-guide/linux.html +++ b/guide/setup-guide/linux.html @@ -12,12 +12,12 @@ - + -
Skip to content

Setup guide – Linux

Preparation

You have to prepare the following folder structure:

...
+    
Skip to content

Setup guide – Linux

Preparation

You have to prepare the following folder structure:

...
 ├── ocr4all
 │   ├── data
 │   |   ├── [Your book]
@@ -34,8 +34,8 @@
     --name ocr4all \
     -v $PWD/data:/var/ocr4all/data \
     -v $PWD/models:/var/ocr4all/models/custom \
-    -it uniwuezpd/ocr4all
  • Do not enter line breaks manually!

Browser access and further use

  • OCR4all is optimized for Chrome/Chromium.
  • Browser access: http://localhost:1476/ocr4all/
  • If you want to check whether the mapping is working correctly you can add the example projects Cirurgia and GNM from here to your data directory. In the browser, check Project Overview -> Project selection: If you can find the two aforementioned books (or any other book that you added), the mapping (-v $PWD/data:/…) is working properly.
  • Otherwise, it´s likely that there was a typo in the “docker run” command, so you will have to create the container again. First, delete the container you just created:
  • Stop the process in the terminal using CTRL+C, then type:
docker rm ocr4all
  • Check and correct your command (as with most terminals, you can sift through your previous commands using the arrow keys), especially the -v $PWD/data:/… lines, then run it again.
  • If everything is set up properly, you are able to restart OCR4all in the future by using:
docker start –ia ocr4all
- + -it uniwuezpd/ocr4all
  • Do not enter line breaks manually!

Browser access and further use

  • OCR4all is optimized for Chrome/Chromium.
  • Browser access: http://localhost:1476/ocr4all/
  • If you want to check whether the mapping is working correctly you can add the example projects Cirurgia and GNM from here to your data directory. In the browser, check Project Overview -> Project selection: If you can find the two aforementioned books (or any other book that you added), the mapping (-v $PWD/data:/…) is working properly.
  • Otherwise, it´s likely that there was a typo in the “docker run” command, so you will have to create the container again. First, delete the container you just created:
  • Stop the process in the terminal using CTRL+C, then type:
docker rm ocr4all
  • Check and correct your command (as with most terminals, you can sift through your previous commands using the arrow keys), especially the -v $PWD/data:/… lines, then run it again.
  • If everything is set up properly, you are able to restart OCR4all in the future by using:
docker start –ia ocr4all
+ \ No newline at end of file diff --git a/guide/setup-guide/macos.html b/guide/setup-guide/macos.html index ba8af259..06d4aa19 100644 --- a/guide/setup-guide/macos.html +++ b/guide/setup-guide/macos.html @@ -12,12 +12,12 @@ - + -
Skip to content

Setup Guide – macOS

Preparation

You have to prepare the following folder structure:

...
+    
Skip to content

Setup Guide – macOS

Preparation

You have to prepare the following folder structure:

...
 ├── ocr4all
 │   ├── data
 │   |   ├── [Your book]
@@ -33,8 +33,8 @@
     --name ocr4all \
     -v $PWD/data:/var/ocr4all/data \
     -v $PWD/models:/var/ocr4all/models/custom \
-    -it uniwuezpd/ocr4all
  • Do not enter line breaks manually!

Browser access and further use

  • OCR4all is optimized for Chrome/Chromium.

  • Browser access: http://localhost:1476/ocr4all/

  • If you want to check whether the mapping is working correctly you can add the example projects Cirurgia and GNM from here to your data directory. In the browser, check Project Overview -> Project selection: If you can find the two aforementioned books (or any other book that you added), the mapping (-v $PWD/data:/…) is working properly.

  • Otherwise, it´s likely that there was a typo in the “docker run” command, so you will have to create the container again. First, delete the container you just created:

  • Stop the process in the terminal using CTRL+C, then type:

docker rm ocr4all
  • Check and correct your command (as with most terminals, you can sift through your previous commands using the arrow keys), especially the -v $PWD/data:/… lines, then run it again.
  • If everything is set up properly, you are able to restart OCR4all in the future by using:
docker start –ia ocr4all
- + -it uniwuezpd/ocr4all
  • Do not enter line breaks manually!

Browser access and further use

  • OCR4all is optimized for Chrome/Chromium.

  • Browser access: http://localhost:1476/ocr4all/

  • If you want to check whether the mapping is working correctly you can add the example projects Cirurgia and GNM from here to your data directory. In the browser, check Project Overview -> Project selection: If you can find the two aforementioned books (or any other book that you added), the mapping (-v $PWD/data:/…) is working properly.

  • Otherwise, it´s likely that there was a typo in the “docker run” command, so you will have to create the container again. First, delete the container you just created:

  • Stop the process in the terminal using CTRL+C, then type:

docker rm ocr4all
  • Check and correct your command (as with most terminals, you can sift through your previous commands using the arrow keys), especially the -v $PWD/data:/… lines, then run it again.
  • If everything is set up properly, you are able to restart OCR4all in the future by using:
docker start –ia ocr4all
+ \ No newline at end of file diff --git a/guide/setup-guide/quickstart.html b/guide/setup-guide/quickstart.html index 50641a74..c4f5f4f0 100644 --- a/guide/setup-guide/quickstart.html +++ b/guide/setup-guide/quickstart.html @@ -12,18 +12,18 @@ - + -
Skip to content

Quickstart

sudo docker run -p 1476:8080 \
+    
Skip to content

Quickstart

sudo docker run -p 1476:8080 \
     -u `id -u root`:`id -g $USER` \
     --name ocr4all \
     -v $PWD/data:/var/ocr4all/data \
     -v $PWD/models:/var/ocr4all/models/custom \
-    -it uniwuezpd/ocr4all
  • Access OCR4all in your browser under http://localhost:1476/ocr4all/

Detailed setup guides

For more detailed instructions follow one of the setup guides below

- + -it uniwuezpd/ocr4all
  • Access OCR4all in your browser under http://localhost:1476/ocr4all/

Detailed setup guides

For more detailed instructions follow one of the setup guides below

+ \ No newline at end of file diff --git a/guide/setup-guide/windows.html b/guide/setup-guide/windows.html index 3fb8a955..ae5afd36 100644 --- a/guide/setup-guide/windows.html +++ b/guide/setup-guide/windows.html @@ -12,12 +12,12 @@ - + -
Skip to content

Setup guide – Windows

Preparation

You have to prepare the following folder structure:

...
+    
Skip to content

Setup guide – Windows

Preparation

You have to prepare the following folder structure:

...
 ├── ocr4all
 │   ├── data
 │   |   ├── [Your book]
@@ -29,8 +29,8 @@
 │   |   |   ├── input
 │   |   |   |   ...
 │   ├── models
-...

Explanation:

  • ocr4all (main folder)
  • models (folder for the neural network models)
  • data (folder for the documents you want to recognize)
  • [Your book] (folder that contains all data of a single, specific print/book)
  • input (folder for original, coloured / grayscaled book scans on page level)

Choosing the right Docker version

  • You will need the Community Edition (CE) of Docker for installation.
  • Docker for Windows:
    • Available for Windows 10, 64 bit: Pro, Enterprise or Education (Build 14393 or later; check for your version, which can be found in your System Information)
    • https://docs.docker.com/docker-for-windows/release-notes/ (If you do not want to register, do not chose “Download Docker for Windows” right away, but instead use “Download” under the “Stable Releases” section below)

Docker for Windows

Docker Setup

  • Follow the installation guide under https://docs.docker.com/desktop/windows/install/.

  • Make sure to give all needed permissions, install all additional drivers etc.

  • Start Docker.

  • Adjust the Docker settings (Right-click on the Docker symbol in the hidden bottom-right toolbar, then chose Settings):

    • Shared Drives: Chosen drive (or partition).
      • You will need at least one. Our recommendation: Simply use C:.
      • Click Apply. (Attention: This requires a valid, non-empty Windows password. Changing or removing the password later results in a silent removal of your Docker privileges!).
    • Advanced: Adjust CPUs (max) and Memory (2GB+) if you want to.

OCR4all Setup

  • Move the OCR4all folder structure detailed above (Preparation) to the shared drive (or partition). In the following example, we use C:\Users\Public\ocr4all\.... We recommend to use the same for the first setup.
  • Inside the OCR4all folder, open PowerShell (Shift + right click inside OCR4all folder -> Open PowerShell window here) and load an OCR4all image using the following command (this will take up a few minutes and requires a stable connection to the internet):
docker pull uniwuezpd/ocr4all
  • Create the OCR4all container using the following command (Note: this works only for the recommended setup, i.e. when the OCR4all folder is located in C:\Users\Public\...)
docker run -p 1476:8080 --name ocr4all -v C:\Users\Public\ocr4all\data:/var/ocr4all/data -v C:\Users\Public\ocr4all\models:/var/ocr4all/models/custom -it uniwuezpd/ocr4all
  • Do not enter line breaks manually!

  • Alternatively, you will have to adjust the paths marked in bold print.

    • Use absolute paths!
    • Use auto completion! (default: Tabulator)
    • It is recommended to not use print working directory (PWD) in this case.

Browser access and further use

  • OCR4all is optimized for Chrome/Chromium.

  • Browser access: http://localhost:1476/ocr4all/- If you want to check whether the mapping is working correctly you can add the example projects Cirurgia and GNM from here to your data directory. In the browser, check Project Overview -> Project selection: If you can find the two aforementioned books (or any other book that you added), the mapping (-v C:\Users\...) is working properly.

  • Otherwise, there might be a typo in the docker run command, so you will have to create the container again. First, delete the container you just created:

  • Stop the process in PowerShell using CTRL + C, then type:

docker rm ocr4all
  • Check and correct your command (as with most terminals, you can sift through your previous commands using the arrow keys), especially the two -v C:\Users\.. lines, then run it again.
  • If everything is set up properly, you are able to restart OCR4all in the future by using
docker start –ia ocr4all
- +...

Explanation:

  • ocr4all (main folder)
  • models (folder for the neural network models)
  • data (folder for the documents you want to recognize)
  • [Your book] (folder that contains all data of a single, specific print/book)
  • input (folder for original, coloured / grayscaled book scans on page level)

Choosing the right Docker version

  • You will need the Community Edition (CE) of Docker for installation.
  • Docker for Windows:
    • Available for Windows 10, 64 bit: Pro, Enterprise or Education (Build 14393 or later; check for your version, which can be found in your System Information)
    • https://docs.docker.com/docker-for-windows/release-notes/ (If you do not want to register, do not chose “Download Docker for Windows” right away, but instead use “Download” under the “Stable Releases” section below)

Docker for Windows

Docker Setup

  • Follow the installation guide under https://docs.docker.com/desktop/windows/install/.

  • Make sure to give all needed permissions, install all additional drivers etc.

  • Start Docker.

  • Adjust the Docker settings (Right-click on the Docker symbol in the hidden bottom-right toolbar, then chose Settings):

    • Shared Drives: Chosen drive (or partition).
      • You will need at least one. Our recommendation: Simply use C:.
      • Click Apply. (Attention: This requires a valid, non-empty Windows password. Changing or removing the password later results in a silent removal of your Docker privileges!).
    • Advanced: Adjust CPUs (max) and Memory (2GB+) if you want to.

OCR4all Setup

  • Move the OCR4all folder structure detailed above (Preparation) to the shared drive (or partition). In the following example, we use C:\Users\Public\ocr4all\.... We recommend to use the same for the first setup.
  • Inside the OCR4all folder, open PowerShell (Shift + right click inside OCR4all folder -> Open PowerShell window here) and load an OCR4all image using the following command (this will take up a few minutes and requires a stable connection to the internet):
docker pull uniwuezpd/ocr4all
  • Create the OCR4all container using the following command (Note: this works only for the recommended setup, i.e. when the OCR4all folder is located in C:\Users\Public\...)
docker run -p 1476:8080 --name ocr4all -v C:\Users\Public\ocr4all\data:/var/ocr4all/data -v C:\Users\Public\ocr4all\models:/var/ocr4all/models/custom -it uniwuezpd/ocr4all
  • Do not enter line breaks manually!

  • Alternatively, you will have to adjust the paths marked in bold print.

    • Use absolute paths!
    • Use auto completion! (default: Tabulator)
    • It is recommended to not use print working directory (PWD) in this case.

Browser access and further use

  • OCR4all is optimized for Chrome/Chromium.

  • Browser access: http://localhost:1476/ocr4all/- If you want to check whether the mapping is working correctly you can add the example projects Cirurgia and GNM from here to your data directory. In the browser, check Project Overview -> Project selection: If you can find the two aforementioned books (or any other book that you added), the mapping (-v C:\Users\...) is working properly.

  • Otherwise, there might be a typo in the docker run command, so you will have to create the container again. First, delete the container you just created:

  • Stop the process in PowerShell using CTRL + C, then type:

docker rm ocr4all
  • Check and correct your command (as with most terminals, you can sift through your previous commands using the arrow keys), especially the two -v C:\Users\.. lines, then run it again.
  • If everything is set up properly, you are able to restart OCR4all in the future by using
docker start –ia ocr4all
+ \ No newline at end of file diff --git a/guide/user-guide/common-errors.html b/guide/user-guide/common-errors.html index f1ccc32d..69ac2960 100644 --- a/guide/user-guide/common-errors.html +++ b/guide/user-guide/common-errors.html @@ -12,13 +12,13 @@ - + -
Skip to content

Common errors

Warning

This page is still under construction. If you come across any problems please contact us.

Errors, frequent problems and how to avoid them

Problems with the installation and start of Docker:

  • Did you encounter problems while installing and starting Docker: you will find a detailed guide here.
  • Do you have difficulties starting the Docker containers for OCR4all? Is no server start possible? First, start Docker again (if necessary, reload OCR4all image anew and reset the corresponding container, following the steps described in the OCR4all setup guide here).
  • Are you using an Apple device with a M1 / M2 chip? We currently don't offer specific images for these systems but are working on it.

Problems selecting works in 'Project Overview':

  • If available works are not displayed in 'project overview', review your folder structure and check if it is correct, following the guidelines outlined in chapter 1.2. If there is no problem with your folder structure, delete the OCR4all Docker container and re-execute the docker run... command, following the setup guide here.
  • Are you unable to select a work? Please ensure that your work/document title contains no blanks or umlauts.

Problems with Calamari recognition or training:

  • Are you experiencing errors with mentions to AVX? If you're using an old CPU w/o AVX or on a virtual machine where AVX passthrough wasn't enabled you might run into several errors during the process execution as official TensorFlow builds don't offer support for these systems.

We welcome all questions and encourage to contact us if you have any problem. Please send an email (consultation, guides, and non-technical user support) or contact us on GitHub.

- +
Skip to content

Common errors

Warning

This page is still under construction. If you come across any problems please contact us.

Errors, frequent problems and how to avoid them

Problems with the installation and start of Docker:

  • Did you encounter problems while installing and starting Docker: you will find a detailed guide here.
  • Do you have difficulties starting the Docker containers for OCR4all? Is no server start possible? First, start Docker again (if necessary, reload OCR4all image anew and reset the corresponding container, following the steps described in the OCR4all setup guide here).
  • Are you using an Apple device with a M1 / M2 chip? We currently don't offer specific images for these systems but are working on it.

Problems selecting works in 'Project Overview':

  • If available works are not displayed in 'project overview', review your folder structure and check if it is correct, following the guidelines outlined in chapter 1.2. If there is no problem with your folder structure, delete the OCR4all Docker container and re-execute the docker run... command, following the setup guide here.
  • Are you unable to select a work? Please ensure that your work/document title contains no blanks or umlauts.

Problems with Calamari recognition or training:

  • Are you experiencing errors with mentions to AVX? If you're using an old CPU w/o AVX or on a virtual machine where AVX passthrough wasn't enabled you might run into several errors during the process execution as official TensorFlow builds don't offer support for these systems.

We welcome all questions and encourage to contact us if you have any problem. Please send an email (consultation, guides, and non-technical user support) or contact us on GitHub.

+ \ No newline at end of file diff --git a/guide/user-guide/introduction.html b/guide/user-guide/introduction.html index f78479a7..660bf025 100644 --- a/guide/user-guide/introduction.html +++ b/guide/user-guide/introduction.html @@ -12,13 +12,13 @@ - + -
Skip to content

User Guide – Introduction

OCR4all is a software which is primarily geared towards the digital text recovery and recognition of early modern prints, whose elaborate printing types and mostly uneven layout challenge the abilities of most standard text recognition software. The workflow established by OCR4all isn’t only easy to understand, but it also allows for an independent use, which makes it particularly suitable for users with no background in computer sciences, in part because it combines different tools into one consistent user interface. Constant switching between different software platforms is thereby rendered obsolete.

OCR4all contains a complete and exhaustive OCR workflow, starting with the pre-processing of the images in question (Preprocessing), followed by layout segmentation (Region Segmentation, done with LAREX), the extraction of classified layout regions and line segmentation (Line Segmentation), text recognition (Recognition) and ending with the correction of the textual end product (Ground Truth Production) – all the while developing OCR models adapted to specific printed texts (fig. 1).

fig. 1. Principal components of the OCR4all workflow. fig. 1. Principal components of the OCR4all workflow.

In part thanks to the possibility of developing and training book-specific recognition models – which can then theoretically be applied to other prints – OCR4all produces very good results when it comes to the digital recognition of about any printed text.

The following guide aims to provide an exhaustive and detailed look into OCR4all’s operation and fields of application concerning the recognition of particularly early prints. While chapter 1 covers the software’s set up and folder structure, chapter 2 concentrates on the recommended pre-processing of scans and image data, a step which occurs outside OCR4all and leads not only to a visible improvement of the results but facilitates the different steps within the OCR4all workflow. Chapter 3 focuses on starting the software and presenting its basic functions. It is followed, in chapter 4, by a detailed, step-by-step description of the different stages of the workflow, introducing the actual processing of prints and generation of the OCR text. Finally, chapter 5 takes on the most common user problems currently known.

- +
Skip to content

User Guide – Introduction

OCR4all is a software which is primarily geared towards the digital text recovery and recognition of early modern prints, whose elaborate printing types and mostly uneven layout challenge the abilities of most standard text recognition software. The workflow established by OCR4all isn’t only easy to understand, but it also allows for an independent use, which makes it particularly suitable for users with no background in computer sciences, in part because it combines different tools into one consistent user interface. Constant switching between different software platforms is thereby rendered obsolete.

OCR4all contains a complete and exhaustive OCR workflow, starting with the pre-processing of the images in question (Preprocessing), followed by layout segmentation (Region Segmentation, done with LAREX), the extraction of classified layout regions and line segmentation (Line Segmentation), text recognition (Recognition) and ending with the correction of the textual end product (Ground Truth Production) – all the while developing OCR models adapted to specific printed texts (fig. 1).

fig. 1. Principal components of the OCR4all workflow. fig. 1. Principal components of the OCR4all workflow.

In part thanks to the possibility of developing and training book-specific recognition models – which can then theoretically be applied to other prints – OCR4all produces very good results when it comes to the digital recognition of about any printed text.

The following guide aims to provide an exhaustive and detailed look into OCR4all’s operation and fields of application concerning the recognition of particularly early prints. While chapter 1 covers the software’s set up and folder structure, chapter 2 concentrates on the recommended pre-processing of scans and image data, a step which occurs outside OCR4all and leads not only to a visible improvement of the results but facilitates the different steps within the OCR4all workflow. Chapter 3 focuses on starting the software and presenting its basic functions. It is followed, in chapter 4, by a detailed, step-by-step description of the different stages of the workflow, introducing the actual processing of prints and generation of the OCR text. Finally, chapter 5 takes on the most common user problems currently known.

+ \ No newline at end of file diff --git a/guide/user-guide/project-start-and-overview.html b/guide/user-guide/project-start-and-overview.html index 833f46bd..139834cb 100644 --- a/guide/user-guide/project-start-and-overview.html +++ b/guide/user-guide/project-start-and-overview.html @@ -12,13 +12,13 @@ - + -
Skip to content

Project Start and Overview

Start Docker:

  • Linux: Docker will start automatically after the computer starts
  • Docker for Windows: start Docker by clicking on the Docker icon (in ‘Programs’) – wait until “Docker is running” pops up
  • Docker Toolbox: open the Docker QuickStart terminal and wait until “Docker is configured to use default machine…” pops up

Start OCR4all:

  • Linux: open the terminal, type in docker start -ia ocr4all, press 'enter' and wait for the server to start
  • Windows 10 (Home, Pro, Enterprise, Education): open Windows PowerShell, type in docker start -ia ocr4all, press 'enter' and wait for the server to start
  • Older Versions of Windows (with Docker Toolbox): open the Docker QuickStart terminal and wait until “Docker is configured to use default machine…” pops up

After this initial installation, you will be able to easily access OCR4all in your browser, respectively under:

  • Linux, Docker for Windows, macOS: http://localhost:1476/ocr4all/
  • Docker Toolbox: http://192.168.99.100:1476/ocr4all/

Once OCR4all has been opened in a browser, the user will automatically land on the 'Project Overview' starting page. From there, they will be able to access several features:

  • 'Settings': This feature allows for selecting the book set to be worked on, which can be chosen from the dropdown menu found under ‘Project selection’ – the title having been previously saved as a folder under ocr4all/data/book title (see 1.2). Additionally, the ‘gray’ setting must be selected under the menu point ‘Project image type’.

Abb2.jpg

fig. 2: Project Overview settings.

  • Following this initial set up, click on ‘load project’ in order for the book in question to be uploaded to the OCR4all platform. Seeing as OCR4all only accepts certain file designations and formats (i.e. 0001.png etc.), a data conversion might be requested which can be directly carried out in OCR4all (fig. 3).

  • It is irrelevant whether a PDF or individual images were placed in the 'input' folder. If possible, however, single images are usually preferred.

Abb3.png

fig. 3. Data conversion request (i.g. PDF in 'input' folder).

  • In OCR4all, all data generated during the workflow and for its functioning are saved in a single PAGE XML file per scan page and are no longer kept in the form of many individual files. If project data from previous versions is still available, it is now possible to convert the project automatically into the structure required by OCR4all.

  • The feature "Overview" provides the user with a tabular presentation of the project’s ongoing progress (fig. 4). Each row corresponds to an individual book page, labelled by a page identifier (far left column). The following columns illustrate, from left to right, the workflow’s progression. Once a particular step has been executed, it will appear as completed (green check mark) in that work stage’s specific column.

Abb4.png

fig. 4: Overview.

  • Clicking on an individual page’s identifier enables the user to check on the state of that specific page’s processing, as well as on the data generated by it, at any time during the workflow. To this effect, please use the ‘images’ column, as well as the subsequent options: ‘original’, ‘binary’, ‘gray’ and ‘noise removal’.

  • With the button "Export GT" (top right) all data created in the course of the project can be exported and packed as a zip folder within 'data'.

- +
Skip to content

Project Start and Overview

Start Docker:

  • Linux: Docker will start automatically after the computer starts
  • Docker for Windows: start Docker by clicking on the Docker icon (in ‘Programs’) – wait until “Docker is running” pops up
  • Docker Toolbox: open the Docker QuickStart terminal and wait until “Docker is configured to use default machine…” pops up

Start OCR4all:

  • Linux: open the terminal, type in docker start -ia ocr4all, press 'enter' and wait for the server to start
  • Windows 10 (Home, Pro, Enterprise, Education): open Windows PowerShell, type in docker start -ia ocr4all, press 'enter' and wait for the server to start
  • Older Versions of Windows (with Docker Toolbox): open the Docker QuickStart terminal and wait until “Docker is configured to use default machine…” pops up

After this initial installation, you will be able to easily access OCR4all in your browser, respectively under:

  • Linux, Docker for Windows, macOS: http://localhost:1476/ocr4all/
  • Docker Toolbox: http://192.168.99.100:1476/ocr4all/

Once OCR4all has been opened in a browser, the user will automatically land on the 'Project Overview' starting page. From there, they will be able to access several features:

  • 'Settings': This feature allows for selecting the book set to be worked on, which can be chosen from the dropdown menu found under ‘Project selection’ – the title having been previously saved as a folder under ocr4all/data/book title (see 1.2). Additionally, the ‘gray’ setting must be selected under the menu point ‘Project image type’.

Abb2.jpg

fig. 2: Project Overview settings.

  • Following this initial set up, click on ‘load project’ in order for the book in question to be uploaded to the OCR4all platform. Seeing as OCR4all only accepts certain file designations and formats (i.e. 0001.png etc.), a data conversion might be requested which can be directly carried out in OCR4all (fig. 3).

  • It is irrelevant whether a PDF or individual images were placed in the 'input' folder. If possible, however, single images are usually preferred.

Abb3.png

fig. 3. Data conversion request (i.g. PDF in 'input' folder).

  • In OCR4all, all data generated during the workflow and for its functioning are saved in a single PAGE XML file per scan page and are no longer kept in the form of many individual files. If project data from previous versions is still available, it is now possible to convert the project automatically into the structure required by OCR4all.

  • The feature "Overview" provides the user with a tabular presentation of the project’s ongoing progress (fig. 4). Each row corresponds to an individual book page, labelled by a page identifier (far left column). The following columns illustrate, from left to right, the workflow’s progression. Once a particular step has been executed, it will appear as completed (green check mark) in that work stage’s specific column.

Abb4.png

fig. 4: Overview.

  • Clicking on an individual page’s identifier enables the user to check on the state of that specific page’s processing, as well as on the data generated by it, at any time during the workflow. To this effect, please use the ‘images’ column, as well as the subsequent options: ‘original’, ‘binary’, ‘gray’ and ‘noise removal’.

  • With the button "Export GT" (top right) all data created in the course of the project can be exported and packed as a zip folder within 'data'.

+ \ No newline at end of file diff --git a/guide/user-guide/scan-preparation.html b/guide/user-guide/scan-preparation.html index a70861bd..e372cf4b 100644 --- a/guide/user-guide/scan-preparation.html +++ b/guide/user-guide/scan-preparation.html @@ -12,13 +12,13 @@ - + -
Skip to content

Scan and Image Preparation (ScanTailor)

When it comes to early modern prints, the available material exists often solely in the form of facsimilia. Although they generally exhibit a good if not very good quality, their overall condition makes them rather unsuited for a direct export in OCR4all. This is the case when the image file contains, aside the mere text, pictures of the book cover or printing surface. Were those images to be binarized during the workflow, black lines will often occur which are due to contrast differences in the original image and will impair both the OCR and the segmentation. Scan rotation and the display of two book pages on the same scan are other, frequent problems.

However, these complications can be easily avoided through the appropriate preparation of the image files. Therefore, scans destined to be processed with OCR4all should ideally only feature the content of each single page meant for the recognition process. At the time, the ideal scan should also contain enough blank page space so as not to impede further steps, such as segmentation. Thus, only the page content deemed irrelevant to the recognition process should be removed while taking care to leave as much of the original scanned page as possible (concretely, this means page margins shouldn’t be entirely removed).

Theoretically, most image editors are suitable (GIMP, Adobe Photoshop, etc.). If you have a PDF available it's also possible to cut and rotate the pages with Adobe Acrobat DC (Batch Processing). However, we advise the user to employ ScanTailor which sustains a considerable data quantity and processes images quickly, efficiently and in a standardized manner. Detailed instructions can be found here.

This step is completely optional and not part of the OCR4all workflow, which is why no support will be provided here. Each user has to decide for himself whether additional preprocessing of this kind would be profitable for his work or even necessary.

- +
Skip to content

Scan and Image Preparation (ScanTailor)

When it comes to early modern prints, the available material exists often solely in the form of facsimilia. Although they generally exhibit a good if not very good quality, their overall condition makes them rather unsuited for a direct export in OCR4all. This is the case when the image file contains, aside the mere text, pictures of the book cover or printing surface. Were those images to be binarized during the workflow, black lines will often occur which are due to contrast differences in the original image and will impair both the OCR and the segmentation. Scan rotation and the display of two book pages on the same scan are other, frequent problems.

However, these complications can be easily avoided through the appropriate preparation of the image files. Therefore, scans destined to be processed with OCR4all should ideally only feature the content of each single page meant for the recognition process. At the time, the ideal scan should also contain enough blank page space so as not to impede further steps, such as segmentation. Thus, only the page content deemed irrelevant to the recognition process should be removed while taking care to leave as much of the original scanned page as possible (concretely, this means page margins shouldn’t be entirely removed).

Theoretically, most image editors are suitable (GIMP, Adobe Photoshop, etc.). If you have a PDF available it's also possible to cut and rotate the pages with Adobe Acrobat DC (Batch Processing). However, we advise the user to employ ScanTailor which sustains a considerable data quantity and processes images quickly, efficiently and in a standardized manner. Detailed instructions can be found here.

This step is completely optional and not part of the OCR4all workflow, which is why no support will be provided here. Each user has to decide for himself whether additional preprocessing of this kind would be profitable for his work or even necessary.

+ \ No newline at end of file diff --git a/guide/user-guide/setup-and-folder-structure.html b/guide/user-guide/setup-and-folder-structure.html index b2d2d6f2..f6469676 100644 --- a/guide/user-guide/setup-and-folder-structure.html +++ b/guide/user-guide/setup-and-folder-structure.html @@ -12,13 +12,13 @@ - + -
Skip to content

Set up and folder structure

Once OCR4all has been successfully installed, the ‘ocr4all’ folder and its two subfolders, data and models, provide the user with the basic and indispensable folder structure for the processing of printed texts.

data contains all the data the user intends to work on with OCR4all as well as all automatically generated data produced with OCR4all during the workflow. In order to complete the structure, data must contain a title folder, whose name can be freely chosen (whereby umlauts and blanks should be avoided) and which itself contains another subfolder titled input in which the original scans or images must be deposited. As the OCR4all workflow progresses, a processing folder will be automatically generated on the same system level, to which images corresponding to the processing stages of the user’s scans and PAGE XML files will be added.

Additionally, the user can save mixed recognition models in the ‘models’ folder (you will find a selection here). This folder will also contain book-specific models generated with OCR4all, which will be saved in sub folders named after the relevant book/work titles. Once a particular training starts, the generated models will be saved in such models/work_title folders and numbered accordingly, starting with 0.

- +
Skip to content

Set up and folder structure

Once OCR4all has been successfully installed, the ‘ocr4all’ folder and its two subfolders, data and models, provide the user with the basic and indispensable folder structure for the processing of printed texts.

data contains all the data the user intends to work on with OCR4all as well as all automatically generated data produced with OCR4all during the workflow. In order to complete the structure, data must contain a title folder, whose name can be freely chosen (whereby umlauts and blanks should be avoided) and which itself contains another subfolder titled input in which the original scans or images must be deposited. As the OCR4all workflow progresses, a processing folder will be automatically generated on the same system level, to which images corresponding to the processing stages of the user’s scans and PAGE XML files will be added.

Additionally, the user can save mixed recognition models in the ‘models’ folder (you will find a selection here). This folder will also contain book-specific models generated with OCR4all, which will be saved in sub folders named after the relevant book/work titles. Once a particular training starts, the generated models will be saved in such models/work_title folders and numbered accordingly, starting with 0.

+ \ No newline at end of file diff --git a/guide/user-guide/workflow.html b/guide/user-guide/workflow.html index 6b6a3729..b6f1ec09 100644 --- a/guide/user-guide/workflow.html +++ b/guide/user-guide/workflow.html @@ -12,14 +12,14 @@ - + -
Skip to content

Workflow

Process Flow

This variant (main menu ☰ → Process Flow) allows for a virtually automated workflow. It merely requires the initial pick of the intended scans (sidebar on the right) and subsequent selection of the individual processing steps the user wishes to apply to the chosen data (fig. 5).

'Process flow' Subcomponents. fig. 6. 'Process flow' Subcomponents.

In order to complete the process, choose an appropriate OCR model (or model package, composed of five individual models working simultaneously and in concert – see chapter 4.7). Simply go to ‘setting’ → ‘recognition’ → ‘general’ (as illustrated in fig. 6) and choose from the list of available OCR models (‘line recognition models’ – ‘available’).

Selection of an appropriate OCR model. fig. 7. Selection of an appropriate OCR model.

Although it is generally possible to choose more than one recognition model, this is only recommended if the scans in question contain more than one printing type.

Finally, start the ‘process flow’ by clicking on ‘execute’. The current stage of this automated processing is translated into the progress bars and can be reviewed at any time. After the workflow’s completion, the results can be verified under the menu item ‘ground truth production’ (☰) .

Individual lines with their corresponding OCR results. fig. 8. Individual lines with their corresponding OCR results.

If the OCR’s line-based results immediately meet the desired or required accuracy of recognition, final results can be generated (TXT and / or PAGE XML) under menu item ‘result generation’. Were those results not to meet the user’s requirements, they can be once more corrected before the final generation (see chapter 4.8).

Aside this ‘process flow’, OCR4all additionally provides the option of a sequential workflow which enables the user to independently execute the software’s individual submodules (see fig. 1) and their components, thus ensuring the proper correctness and quality of the generated data. Considering that these submodules are built on one another, the sequential workflow seems to be the most adequate choice when working with early modern prints and their intricate, complex layout.

We recommend first-time users execute the sequential workflow at least once (as described below) in order to understand the submodules’ operating principles.

Preprocessing

Input: original image (in colour, greyscale or binarized)
Output: straightened binarized or greyscale image

  • This processing step is meant to produce binarized and normalized greyscale images, a basic requirement for a successful segmentation and recognition.
  • Proceed by selecting the relevant scans (sidebar on the right) – the settings must remain unchanged (‘settings (general)’ and ‘settings (advanced)’), meaning that the images’ angle as well as the automatically generated number of CPUs used by this particular submodule don’t vary either (the latter pertains to all of OCR4all’s subsequent submodules).

Pre-processing settings. fig. 9. Pre-processing settings.

  • Click on ‘execute’ to start binarization. The progression of this work stage can be tracked on the console, more accurately the ‘console output’. Warnings might be issued during the binarization process (in ‘console error’) which have no incidence on the binarization results.
  • In order to check the binarization’s success, simply go to ‘project overview’ and click on any page identifier then on the display option ‘binary’. In addition, all processed pages should be marked with a green check mark in the project overview.

Noise Removal

Input: polluted binarized images
Output: binarized images without (or with very little) pollution

-The noise removal option helps to get rid of small impurities such as stains or blotches on the scans

  • Proceed by clicking on ‘noise removal’ (main menu) and selecting the scans you wish to process on the right side of your display. You should initially conserve the default settings and, after clicking on ‘execute’, check the initial quality of the results: simply click on the designation of the scan you wish to verify (right sidebar); the ‘image preview’ option will provide you with a side by side comparison of the image before and after the noise removal. Please note that red elements will be deleted by this step.

Noise removal settings. fig. 10. Noise removal settings.

  • If too many interfering elements remain on the image, slightly adjust the ‘maximal size for removing contours’ factor upwards and repeat the step by clicking once again on ‘execute’ and subsequently reviewing the results.
  • If too many elements were removed from the image, readjust the ‘maximal size…’ factor downwards.
  • Try again until the results are satisfactory.

Segmentation – LAREX

Input: pre-processed images
Output: structural information about the layout regions (type and position) as well as reading order

LAREX is a segmentation tool which structures and classifies a printed page’s layout with regard to its later processing. LAREX is based on the basic assumption that the pages of early modern prints are composed of a recurring array of layout elements whose composition, although always book-specific, is largely consistent. Thus, the user is provided with different tools and resources whose aim it is to adequately structure and segment a printed page in order to catalogue all layout-related information necessary to the workflow’s subsequent steps. Besides the basic distinction between text and non-text (e.g. text vs. image/woodcut) and its further specifications (e.g. text headline, main text, page number etc.), this also includes information about the page’s reading order, i.e. the reading and usage order of the available layout elements.

Initial Settings

  • Menu: click on ‘segmentation’, then on ‘LAREX’
  • Go to ‘Segmentation image type’: select ‘binary’ if you will be working with binarized images, or ‘despeckled’ if the images went through the noise removal process
  • Click on ‘open LAREX’ → LAREX will open in a new tab

LAREX settings. fig. 11: LAREX settings.

Once LAREX has opened, the first one of the pre-selected pages will be visible at the centre of your display, including a few initial segmentation results, which are generated by the automatic segmentation each page undergoes when initially opened with LAREX. Please note that these results are not saved. From there, the user will have to adjust the settings, tailoring the initial segmentation results to their particular work’s layout and undertaking a manual post-correction to ensure segmentation accuracy.

LAREX interface with automatic segmentation results. fig. 12. LAREX interface with automatic segmentation results.

Overview and toolbar

The left sidebar displays all previously selected scans. Colour-coded markings visible in the bottom right corner indicate the current stage of each scan’s processing:

  • Orange exclamation mark: “there is no segmentation for this page” – no current segmentation results for this page
  • Orange warning sign: “current segmentation may be unsaved”
  • Green floppy disk: “segmentation was saved for this session” – segmentation results have been saved as an XML file
  • Green padlock: “there is a segmentation for this page on the server” – individual previously saved segmentation results (c.) have been marked as correct after completion of the entire document’s segmentation (see below).

fig. 13. Different display modes.

  • With the buttons '0' and '1' it is possible to switch between the binarized (black and white) and the normalized (grayscale) display mode. This selection is noted for all remaining pages of the project. It is possible to change the display mode again at any time.
  • In the topbar, you will find different tools and tool categories pertaining to navigation and image processing:

Different menu items in the toolbar. fig. 14. Different menu items in the toolbar.

  • Open a different book: No settings adjustments necessary for all LAREX versions as integrated in OCR4all.
  • Image Zoom: Enables general settings for image or scan display, such as zoom options. However, these can also be adjusted with your mouse and/or touchpad: shift the displayed page by left click-and-holding and moving your mouse; zoom using mouse wheel or touchpad.
  • Undo und Redo: Undo or redo last user action. Even common key combinations are possible (i.g. CTRL + Z for undo last action).
  • Delete selected items: Delete currently selected region.
  • RoI, Region, Segment, Order: In addition to the right sidebar, these are the different options for processing and segmenting scans. While the options featured in the toolbar generally pertain to the current scan’s processing (see below), the right sidebar features project-wide options across all scans.

Right sidebar’s settings. fig. 15. Right sidebar’s settings.

However, the latter can be amended, changed or adjusted at any time. In this case, we recommend saving all previously carried-out settings, whether they be related to recognition parameters (‘parameters’) or to document-specific layout elements (‘regions’) previously determined by the user, in ‘settings’. This will ensure these particular settings are applied the next time you work with this tool, enabling you to work with document-specific settings from then on.

Specific settings: ‘regions’, ‘parameters’, ‘reading order’, ‘settings’

  • 'Regions': In accordance with the LAREX concept, each scan (that is, each book page) is composed of several, distinct layout elements, e.g. main text (‘paragraph’), title, marginalia, page number, etc. Thus, LAREX requires that corresponding ‘regions’ be assigned to each of these layout elements. This assigning task must be consistently performed throughout the entire work, in preparation for further steps as well as for the actual recognition of the displayed content! Besides a small number of pre-set and predefined layout regions – for instance ‘image’ (graphics such as woodcuts and ornate initials), ‘paragraph’ (main text) or ‘page_number’ – the user can define and add further book-specific layout regions under ‘create’. Not only can the user change a region’s colour, but they can also define the minimum size of a textual/graphical page element which they wish to recognize as such (under ‘minSize’). The layout region thus defined can be added to the book-specific list by clicking on ‘save’.

Range of options under ‘Regions’. fig. 16. Range of options under ‘Regions’.

  • Moreover, the ‘regions’ feature enables the user to assign particular layout regions to a fixed and predefined location on the scan which will then be applied to the following scans. Provided a page’s layout is repeated throughout the entire book, the user can generate something of a layout template in order to improve segmentation and reduce the number of necessary corrections later on. In order to adjust the position of these layout regions to a book’s specific layout, simply display the layout region’s current position and adjust it by selecting the scanned page’s regions.

Layout regions display and template. fig. 17. Layout regions display and template.

Once a new region has been defined, its position on the page can be established by clicking on ‘Region’ → ‘Create a region rectangle (Shortcut: 1)’, an option located in the toolbar. This can be undone or changed at any time. Please note that the category ‘images’ can’t be assigned to a layout region on the page.

Defining new layout regions. fig. 18. Defining new layout regions.

All things considered, it isn’t always advisable to assign fixed positions to all layout regions for an entire book; if the position of certain regions such as chapter titles, mottos, page number or signature marks on the different pages is inconsistent, assigning predefined positions will lead to recognition errors. In this case, manually verifying and correcting these layout elements afterwards is the more practical approach. If the user needs to delete a layout region’s position, they can simply select the region in question and press the ‘delete’ key.

  • 'Parameters': Allows to define overall parameters of image and text recognition. Taking the time to pre-set certain book-specific parameters is recommended when working with an inconsistent layout, particularly that of early modern prints. These often feature great divergences of word and line spacing. To avoid a narrowly spaced group of lines from being recognised as one cohesive textual element, the ‘text dilation’ feature enables you to control and define the text’s degree of dilation in the x- and y-direction. This will enable the software to recognise originally too close word/line spacing or to recognise widely spaced passages as one cohesive element. We recommend trying and testing in order to find the settings best suited to a particular book.

Parameters settings. fig. 19. Parameters settings.

  • 'Settings': Under ‘Settings’ you can save the previously selected displaying and segmentation options as well as loading them anew after an interruption in segmentation (buttons ‘save settings’ and ‘load settings’). Saving will generate an XML file which you will need to select when loading the settings (click on ‘load settings’, a new window will open; select file in question and open it). An additional feature will enable you to re-load previous pages’ segmentation results if you wish to view them again: simply go to ‘advanced settings’ and click on ‘load now’. This will load any previously saved XML file containing that page’s segmentation results.

Settings. fig. 20: Settings.

  • 'Reading Order': In order for the correct order of a page’s textual elements to be taken into account in all steps following segmentation, it is necessary to define these elements’ ‘reading order’ beforehand. This can be done automatically provided a book’s layout be relatively clear and simple. However, should you be working with a more complex layout structure, we recommend you proceed manually. Simply select ‘auto generate a reading order’ or ‘set a reading order’ under toolbar item ‘Order’.

Reading order selection in toolbar fig. 21. Reading order selection in toolbar

By clicking on the auto reading order button, a list of all the page’s textual elements will appear in the right sidebar (under ‘reading order’), sorted from top to bottom. On the other hand, if you wish to manually establish reading order, you will need to click on each of the page’s textual elements, in the correct order (see below), after which this reading order will appear in the aforementioned list. All elements of the reading order can be rearranged with a drag-and-drop or deleted by clicking on the corresponding recycle bin icon. As with everything in LAREX, the reading order can always be changed before saving the final segmentation results.

Exemplary page segmentation

With each page loading, LAREX automatically generates segmentation results – these only need to be subsequently corrected. The following, exemplary segmentation process uses page 4 of reference book Cirurgia, which you can download here when downloading the OCR4all folder structure.

Error analysis: Which layout elements were correctly recognised, which incorrectly, which weren’t at all? Are there any user marks in the margins, bordures, spots or elements of text which will influence segmentation, but you wish to avoid being recognised?

Auto generated results, Cirurgia page 4. fig. 22. Auto generated results, Cirurgia page 4.

'Region of interest' (RoI): Defining a RoI will help exclude certain sections of your page, situated outside the area later subjected to recognition but which can negatively impact segmentation (such as user marks, impurities, library stamps, etc.). To do so, go to toolbar and click on ‘Set the region of interest’ (under ‘RoI’), then use left click-and-hold to draw a rectangle around the page section you wish to segment.

Defining a 'region of interest'. fig. 23. Defining a 'region of interest'.

Once RoI has been defined, click on 'SEGMENT' button (right sidebar) – all element situated outside of RoI are now excluded from any further steps. Once RoI has been defined, it will be automatically transposed to all the book's scans. However, due to a wide array of factors, the page sections relevant to segmentation can shift from scan to scan. Therefore, as processing progresses, the user will likely need to adjust RoI from time to time. To do so, simply click on any RoI section and shift it using the mouse. Independently of RoI, the 'Create an ignore rectangle' option creates an 'ignore region' which allows for certain, small sections of a scan to be ignored and thus excluded from segmentation.

Correcting layout recognition flaws: Incorrectly recognized layout elements can be assigned a new typification manually: a right-hand click on said element will open a pop-up menu from which you can choose the correct designation.

Correcting a faulty typification. fig. 24. Correcting a faulty typification.

Should you need to separate a title from another textual element with which it is fused, there three ways to proceed: To begin, you can draw a rectangle around the section you wish to classify: proceed to toolbar, click on ‘Segment’ and select ‘Create a fixed segment rectangle’ (shortcut: 3); using mouse, draw a rectangle around the relevant section – a pop-up menu will appear from

which it’s correct designation/type can be chosen. Next, you can instead choose to use a polygon shape. This option is particularly suited to the more complex or chaotic layouts and/ or those comprising angled edges, rounded pictures and woodcuts, or ornate initials inside the text block. Proceed to toolbar, click on ‘Segment’, this time selecting ‘Create a fixed segment polygon’ option (shortcut: 4). Using the mouse, generate a dotted line to go around end encompass the entire relevant section – once the line’s end has been joined to its starting point, creating a polygon, the aforementioned pop-up menu will appear to allow for designation. Finally, you can also separate a text block – Initially recognized as one paragraph – into a title and main text using a cutting line: simply go to toolbar and ‘Segment’, and select ‘Create a cut line’ option (shortcut: 5).

Correcting a faulty typification. fig. 25. Toolbar: selecting cut line option.

Using left mouse key, create a line through the element you wish to separate, clicking along its path to adjust it as needed; end line with a double click.

Drawing a line between two layout elements to be separated. fig. 26. Drawing a line between two layout elements to be separated.

Click on 'Segment' in order to prompt separation. Afterwards, title element can be correctly renamed, using right-hand click and pop-up menu (as shown below).

Correcting typification of separated sections. fig. 27. Correcting typification of separated sections.

If at any time you with to delete layout components, inaccurate cutting lines or polygons, etc. simply click on the relevant element and use ‘Delete’ key or ‘Delete selected items’ option in the toolbar.

Determining 'Reading Order' (see below):

Determining reading order. fig. 28. Determining reading order.

Saving current scan’s segmentation results: Save your segmentation results by clicking on ‘Save results’ or with Ctrl + S. This will automatically generate an XML file containing those results inside the OCR4all folder structure.

Saving segmentation results. fig. 29. Saving segmentation results.

Afterwards, you can proceed to the next scan (left sidebar). If you wish to redo or change a scan’s segmentation, you can do as much at any time: simply save the new results – the previous XML file will be automatically deleted and replaced with a new version.

Additional processing options

OCR4all also provides the following scan processing options:

  • While deleting layout elements or joining separate ones to form one, single region, you can select all relevant elements simultaneously by pressing and holding 'Shift' key and drawing a rectangle around the entire region using your mouse. Relevant regions must be located entirely inside the rectangle. Once done, selected region will be surrounded by a blue frame.
  • 'Select contours to combine (with 'C') to segments (see function combine)' (shortcut: 6): this tool is perfect for reaching optimal segmentation results even when working with scans featuring a densely packed and detailed print layout. The basic idea is that layout elements only be delimited by the contours of the individual letters/pictures they are composed of, thus solving the problems created by manual segmentation such as excessively broad margins, which can in turn hamper the OCR performance. To use this feature, click on the relevant button (toolbar) or use shortcut 6. All components of the scan recognized as layout elements will be coloured blue.

Showing contours. fig. 30. Showing contours.

Select individual letters or even parts of letter by clicking on them.

Selecting contours. fig. 31. Selecting contours.

You can also apply your selection to an individual group of letters, entire words or text lines, sections of a layout element (see above: ‘Shift’ + selection with rectangle). Use shortcut C after selection in order to include all selected items – be they letters, words, lines, etc. – in one, new layout element, regardless of the layout region they had previously belonged to. This new element’s edges will be far more precise that those of an automatically generated one, enabling a particularly accurate segmentation superior to that of standardised tools.

Aggregating selected items to create new element. fig. 32. Aggregating selected items to create new element.

Save new element by clicking on ‘Segment’. New element can be renamed as described above.

Typifying new layout element. fig. 33. Typifying new layout element.

  • 'Combine selected segments or contours' (shortcut: C): In order to combine several, distinct layout elements into a new element, select the entire region in question (see above) and click on corresponding button (toolbar) or use shortcut C.
  • 'Fix/unfix segments, for it to persist a new auto segmentation' (Shortcut: F): This function enables you to fix an element in one place beyond your next segmentation rounds. Mark element in question by clicking on it, then use shortcut F or corresponding button in toolbar. Fixed, i. e. pinned elements will appear surrounded by a dotted line. If you wish to cancel fixation, simply repeat the operation.
  • Zoom: Use mouse wheel to zoom in and out of display. Use space key to reset display to its original size.
  • When working with a very complex and intricate layout, targeted interventions can help increase the precision and quality of segmentation results. The contours of all layout elements (recognized as such) consist in fact of many individual lines, separated by dots.

Layout element contours. fig. 34. Layout element contours.

  • These tiny dots can be moved, individually or in groups, e.g. to avoid collision between different layout elements in a dense setting. Use a left click-and-hold to move a dot, click on the line to create a new dot, use 'delete' key to delete a dot.
  • Load results: a scan’s existing segmentation results will be sourced from OCR4all folder structure and directly loaded to LAREX.

Final steps with LAREX

Once a document’s entire segmentation has been completed with LAREX (i.e. once segmentation results have been saved for all pages), results can be found in the OCR4all folder structure. In order to make sure that results were correctly saved, simply go to menu item ‘post correction’, in the ‘segments’ bar (see below).

Line Segmentation

Input: pre-processed images and segmentation information (in the form of PAGE XML files)
Output: extracted text lines saved in those PAGE XML files

  • This step constitutes a direct preparation to the OCR process and features the dissection of all previously defined and classified layout elements into separate text lines (this a necessary step as the OCR is based on line recognition). All results are then automatically saved in the corresponding page XML files.

Line segmentation settings. fig. 35. Line segmentation settings.

  • Generally speaking, all existing settings can be retained. There are, however, a few restrictions when it comes to page layout: if you are working with pages featuring two or more text columns (and if those have been previously defined as separate, individual main text blocks in LAREX), you will need to change the ‘maximum # of whitespace column separators’ which is pre-set at -1.
    • 'Whitespace column separators' are the white columns devoid of text found around text blocks.
    • When working with a two-column layout whose text is continuous (i.e. where the first line of both columns don’t form a semantic unit), you will need to set the ‘maximum # of whitespace column separators’ at 3. This number corresponds to the whitespace on both sides of the columns and to the whitespace situated between them.
    • When working with a three-column layout, set the 'whitespace' number to 4, and so on.
  • Once all desired settings are chosen, click on ‘execute’. Afterwards, control generated results under ‘Project Overview’.
  • Using the ‘settings (advanced)’ option is especially useful when working with line segmentation, particularly if/when errors are reported (and shown on the interface). For instance, small letters will often fall short of the default minimal line width. You can adjust this minimal width by reducing the ‘minimum scale permitted’, which can be found under menu item ‘limits’. This will enable you to correctly re-do the line segmentation.
  • You can generally control the accuracy of line segmentation by clicking on the ‘lines’ button (under menu item ‘post correction’).

Recognition

Input: Text lines and one or more OCR models
Output: OCR-output in the form of text for each of the PAGE XML files at hand

  • This step is where the actual text recognition takes place based on the individual lines and textual layout elements identified during line segmentation (see above).
  • Select menu item 'Recognition': in the right sidebar, you will only find your document's scans (or rather printed pages) for which all OCR pre-processing steps have been completed, by which we mean all previously explained steps - bar 'noise removal'. Please select the scans for which you wish to produce an OCR text.
  • Go to 'line recognition models' (under 'available') and select all models or model packages relevant to the typographical recognition of your text (e.g. early modern/historical Gothic type, italic/cursive type, historical Antiqua etc.). We expressly advise the use of a model package, where five models simultaneously work and interact with each other! This is much preferable to using only one model at a time. You can select all models you wish to add to your package by clicking on each of them - they will automatically be added to the 'selected' category. When dealing with a large amount of models, you can find them by using the 'search' function.

Selection of model package for text recognition. fig. 36. Selection of model package for text recognition.

  • You likely won't need to adjust any of the advanced settings.
  • Click on 'execute' and oversee the text recognition progress on the console.
  • Once recognition is finished, you will be able to view all results under menu item 'ground truth production'.

Ground Truth Production

Input: text line images and their corresponding OCR output when available
Output: line based ground truth

  • Under menu item 'ground truth production' you will be able to view the texts generated during 'recognition', correct them and save them as a training model. This is the so called 'ground truth'.
  • The correction tool used in this step is divided into two parts. On the left handside are the (selectable) scans. In the middle, you will find the segmented text line images (see above for workflow) as well as their corresponding OCR text lines, placed directly underneath. We call this standard display 'text view'.

Ground truth production with 'text view'. fig. 37. Ground truth production with 'text view'.

Clicking on the 'Switch to page view' button will bring you to the so called 'page view' display, in which you can work on all text lines while they are displayed in relation to the entire page layout. By clicking 'switch to text view', you will return to the initial 'text view' display.

Ground truth production with 'page view'. fig. 38. Ground truth production with 'page view'.

  • On the right hand side of the display, you will find the virtual keyboard, with which you can set special characters such as ligatures, abbreviation, diacritical signs etc. Simply place your cursor where you with to insert a special character and then click on said character in the virtual keyboard. In order to add new characters to the virtual keyboard, simply click on the plus icon, add character through copy and paste in the blank and click on 'save'. if you wish to delete characters from the virtual keyboard, drag and drop said character on the recycle bin icon. Once all necessary/desired changes have been made, click on 'save' and 'lock'. Using buttons 'lad' and 'save' will ultimately enable you to save different virtual keyboards specific to any particular document. Once a virtual keyboard has been saved as such, it can be re-loaded at any time, which is particularly useful when you need to interrupt correction - or if you want to use this keyboard for another document for which it is suited.
  • In order to correct individual lines in 'text view' mode, click on the line in question: you can now correct and edit it. (When working with 'page view', you will need to click on the line you wish to edit first, after which a text field will appear in which you will be able to proceed to corrections/edits as well. Use 'tabulator' key to go to the next line, and so on. All following steps are identical in both viewers. Once a text line has been completely and satisfactorily corrected, press 'enter key'. The line will be coloured green, meaning it will be automatically saved as 'ground truth' in OCR4all once the entire page has been completed and saved (by clicking on 'save result' or using shortcut crtl + S). Once a line has been identified as ground truth, it can be used as a basis for OCR training as well as a tool to evaluate the OCR model you used.
  • If there are erroneously recognised text line images among your pairs of text lines images and corresponding OCR text lines, please let your OCR text lines unfilled to not cause problems during the OCR model training.
  • Were you to conclude, while working on ground truth production, that the quality of the text recognition achieved with mixed models wasn't satisfactory, you can always perform a final, manual text correction by employing a training model targeted towards the specific kind of document you are working on. Proceeding to this step will generally increase the recognition quality and percentage.

Evaluation

Input: line based OCR texts and corresponding ground truth
Output: error statistics

  • Under menu item 'evaluation', users can check on the recognition rate of the model(s) currently under use.

  • In order to generate an evaluation, go to right sidebar and select all the scans recognized with the help of said model and subsequently corrected during 'ground truth production'.

  • Click on 'execute': a chart will appear in the console. At the top, you will see the percentage of errors as well as the full count of errors ('errs'). All identified errors are listed underneath, displayed as a chart featuring the comparison between the initially recognized text ('PRED', righthand column) and the results of ground truth production ('GT', lefthand column). Behind each error item, you will see the frequency of that particular type of error as well as its percentage compared to the entire error count.

Evaluation results with general error rate, ten most frequent errors as well as their percentage
-compared to entire error count. fig. 39. Evaluation results with general error rate, ten most frequent errors as well as their percentage compared to entire error count.

  • Thanks to the spreadsheet and its display (100% - error rate), users can evaluate whether a new training using individual, targeted models is necessary.

Training

Input: text line images with corresponding ground truth (as an option, existing OCR models can be included as well, which are used as so called 'pre-training' and as basis for model training
Output: one or more OCR model(s)

The aim of our software is to produce a text containing as few errors as possible. In that case, why is even necessary to use the training module and produce models targeted to your document, instead of simply correcting it manually? In fact, the better a recognition model the shorter the correction time. The idea of a continuous model training is to train increasingly better models through continuous corrections, which in turn will reduce the amount of corrections needed for the next pages, and so on.

  • With this training tool, users will be able to train models tailored to their document, based on the lines of ground truth available for this document. In order to begin training, please proceed to the following adjustments in general settings:
    • Set the 'Number of folds to train' (i.e. the number of models to train) to 5. → Training will occur with a model package containing five individual models.
    • 'Only train a single fold box': please don't fill out this box!
    • Set the 'Number of models to train in parallel' at -1. → All training models will be trained simultaneously.
    • If all characters contained in the pretraining model need to be kept in the model you wish to train (i.e. added to its so called whitelist), please check the 'Keep codec of the loaded model(s)' box.
    • In effect, the 'Whitelist characters to keep in the model' is the exhaustive list of characters used during training and in the subsequently generated model. Any character not contained in the whitelist won't be included in the process.
    • 'Pretraining': Either 'Train each model based on different existing models' (a menu will appear containing five dropdown lists. Inside each of them, enter one of the five models belonging to the model package used as advised earlier. Regardless of the training step (be it the first round or the third), always enter the five models used since the beginning) or 'Train all models based on one existing model' (click on this setting if you started training using only one model. Simply select that exact training model for each repetition of the training process).
    • 'Data augmentation': Please don't fill out this box! This function describes the data augmentation per line. Users can enter a number, e.g. 5, in order to increase the amount of training material. This can lead to the generation of better performing models. However, this process is more time-costly than the standard route.
    • 'Skip retraining on real data only': Please don't fill out this box!
  • The advanced settings remain unchanged.

Settings for the training of document-specific models. fig. 40. Settings for the training of document-specific models.

  • Click on 'execute' to start training. You will be able to view the training progress at any time in the console. Training time will vary depending on the total amount of ground truth lines.
  • In accordance with the aforementioned settings, a model package (containing five individual models and tailored to your document's exact needs) will be generated through training and automatically saved in folder ocr4all/models/document title/0. Going forward, this model package will be labelled '0'. From this point on, while working on this document and striving towards improving recognition, you will be able to select said package under menu item 'recognition' among other models, when working with new pages from the same document. If you wish to generate a second document-specific model package (e.g. to improve the first one's weaknesses), simply repeat the process as described above. This new model will be labelled '1', and so on.

Post Correction

Input: segmentation information and metadata on pre-processed scans, as well as the corresponding text
Output: corrected/improved segmentation info and text Under menu item 'post correction', users will be able to manually adjust and correct all segmentation info and text generated through the course of the previous sub-modules. This sub-module is itself divided into three levels:

  • The item 'segment' (i.e. level 1) will enable you to adjust all regions determined during segmentation and their reading order, page after page. You will recognize a few of the tools from working with LAREX (see above). Please note that all changes undertaken at this level will have consequences for the following levels. For example, if you decide to delete a certain region during level 1, you will loose all text lines belonging to this region going forward.
  • The 'lines' item (i.e. level 2) enables you to manually adjust automatic line recognition. You will be able to add lines where there were none, to change their shape or position, or to delete them. The reading order can be adjusted as well, on a line basis.

Adjusting line-based reading order during post correction. fig. 41. Adjusting line-based reading order during post correction.

  • Under item 'text' (i.e. level 3), you will find the afore-described ground truth submodule, in which the text content of your lines can be corrected once more.

Result Generation

Input: line-based OCR results, ground truth (optional - only if at hand) and the LAREX-segmentation and line-segmentation data
Output: final text output (lines will be re-grouped into pages and full-text) as well as page based PAGE XML

Result Generation. fig. 42: Result Generation.

  • Once the user considers all recognition and correction steps to be finalized, results can be generated as TXT or XML files, saved under ocr4all/data/results.
  • You can choose whether you need a text or PAGE XML file under 'settings'. If you opt for a text file, individual TXT files will be generated for each scan as well as an additional one containing your document's entire text.
  • PAGE XML files are also generated on a page-base and additionally contain data about creation date, last changes in the file, metadata about each page's corresponding scan, about the page's size, its layout regions and their exact coordinates, its reading order, its text lines and their text content.
- +
Skip to content

Workflow

Process Flow

This variant (main menu ☰ → Process Flow) allows for a virtually automated workflow. It merely requires the initial pick of the intended scans (sidebar on the right) and subsequent selection of the individual processing steps the user wishes to apply to the chosen data (fig. 5).

'Process flow' Subcomponents. fig. 6. 'Process flow' Subcomponents.

In order to complete the process, choose an appropriate OCR model (or model package, composed of five individual models working simultaneously and in concert – see chapter 4.7). Simply go to ‘setting’ → ‘recognition’ → ‘general’ (as illustrated in fig. 6) and choose from the list of available OCR models (‘line recognition models’ – ‘available’).

Selection of an appropriate OCR model. fig. 7. Selection of an appropriate OCR model.

Although it is generally possible to choose more than one recognition model, this is only recommended if the scans in question contain more than one printing type.

Finally, start the ‘process flow’ by clicking on ‘execute’. The current stage of this automated processing is translated into the progress bars and can be reviewed at any time. After the workflow’s completion, the results can be verified under the menu item ‘ground truth production’ (☰) .

Individual lines with their corresponding OCR results. fig. 8. Individual lines with their corresponding OCR results.

If the OCR’s line-based results immediately meet the desired or required accuracy of recognition, final results can be generated (TXT and / or PAGE XML) under menu item ‘result generation’. Were those results not to meet the user’s requirements, they can be once more corrected before the final generation (see chapter 4.8).

Aside this ‘process flow’, OCR4all additionally provides the option of a sequential workflow which enables the user to independently execute the software’s individual submodules (see fig. 1) and their components, thus ensuring the proper correctness and quality of the generated data. Considering that these submodules are built on one another, the sequential workflow seems to be the most adequate choice when working with early modern prints and their intricate, complex layout.

We recommend first-time users execute the sequential workflow at least once (as described below) in order to understand the submodules’ operating principles.

Preprocessing

Input: original image (in colour, greyscale or binarized)
Output: straightened binarized or greyscale image

  • This processing step is meant to produce binarized and normalized greyscale images, a basic requirement for a successful segmentation and recognition.
  • Proceed by selecting the relevant scans (sidebar on the right) – the settings must remain unchanged (‘settings (general)’ and ‘settings (advanced)’), meaning that the images’ angle as well as the automatically generated number of CPUs used by this particular submodule don’t vary either (the latter pertains to all of OCR4all’s subsequent submodules).

Pre-processing settings. fig. 9. Pre-processing settings.

  • Click on ‘execute’ to start binarization. The progression of this work stage can be tracked on the console, more accurately the ‘console output’. Warnings might be issued during the binarization process (in ‘console error’) which have no incidence on the binarization results.
  • In order to check the binarization’s success, simply go to ‘project overview’ and click on any page identifier then on the display option ‘binary’. In addition, all processed pages should be marked with a green check mark in the project overview.

Noise Removal

Input: polluted binarized images
Output: binarized images without (or with very little) pollution

-The noise removal option helps to get rid of small impurities such as stains or blotches on the scans

  • Proceed by clicking on ‘noise removal’ (main menu) and selecting the scans you wish to process on the right side of your display. You should initially conserve the default settings and, after clicking on ‘execute’, check the initial quality of the results: simply click on the designation of the scan you wish to verify (right sidebar); the ‘image preview’ option will provide you with a side by side comparison of the image before and after the noise removal. Please note that red elements will be deleted by this step.

Noise removal settings. fig. 10. Noise removal settings.

  • If too many interfering elements remain on the image, slightly adjust the ‘maximal size for removing contours’ factor upwards and repeat the step by clicking once again on ‘execute’ and subsequently reviewing the results.
  • If too many elements were removed from the image, readjust the ‘maximal size…’ factor downwards.
  • Try again until the results are satisfactory.

Segmentation – LAREX

Input: pre-processed images
Output: structural information about the layout regions (type and position) as well as reading order

LAREX is a segmentation tool which structures and classifies a printed page’s layout with regard to its later processing. LAREX is based on the basic assumption that the pages of early modern prints are composed of a recurring array of layout elements whose composition, although always book-specific, is largely consistent. Thus, the user is provided with different tools and resources whose aim it is to adequately structure and segment a printed page in order to catalogue all layout-related information necessary to the workflow’s subsequent steps. Besides the basic distinction between text and non-text (e.g. text vs. image/woodcut) and its further specifications (e.g. text headline, main text, page number etc.), this also includes information about the page’s reading order, i.e. the reading and usage order of the available layout elements.

Initial Settings

  • Menu: click on ‘segmentation’, then on ‘LAREX’
  • Go to ‘Segmentation image type’: select ‘binary’ if you will be working with binarized images, or ‘despeckled’ if the images went through the noise removal process
  • Click on ‘open LAREX’ → LAREX will open in a new tab

LAREX settings. fig. 11: LAREX settings.

Once LAREX has opened, the first one of the pre-selected pages will be visible at the centre of your display, including a few initial segmentation results, which are generated by the automatic segmentation each page undergoes when initially opened with LAREX. Please note that these results are not saved. From there, the user will have to adjust the settings, tailoring the initial segmentation results to their particular work’s layout and undertaking a manual post-correction to ensure segmentation accuracy.

LAREX interface with automatic segmentation results. fig. 12. LAREX interface with automatic segmentation results.

Overview and toolbar

The left sidebar displays all previously selected scans. Colour-coded markings visible in the bottom right corner indicate the current stage of each scan’s processing:

  • Orange exclamation mark: “there is no segmentation for this page” – no current segmentation results for this page
  • Orange warning sign: “current segmentation may be unsaved”
  • Green floppy disk: “segmentation was saved for this session” – segmentation results have been saved as an XML file
  • Green padlock: “there is a segmentation for this page on the server” – individual previously saved segmentation results (c.) have been marked as correct after completion of the entire document’s segmentation (see below).

fig. 13. Different display modes.

  • With the buttons '0' and '1' it is possible to switch between the binarized (black and white) and the normalized (grayscale) display mode. This selection is noted for all remaining pages of the project. It is possible to change the display mode again at any time.
  • In the topbar, you will find different tools and tool categories pertaining to navigation and image processing:

Different menu items in the toolbar. fig. 14. Different menu items in the toolbar.

  • Open a different book: No settings adjustments necessary for all LAREX versions as integrated in OCR4all.
  • Image Zoom: Enables general settings for image or scan display, such as zoom options. However, these can also be adjusted with your mouse and/or touchpad: shift the displayed page by left click-and-holding and moving your mouse; zoom using mouse wheel or touchpad.
  • Undo und Redo: Undo or redo last user action. Even common key combinations are possible (i.g. CTRL + Z for undo last action).
  • Delete selected items: Delete currently selected region.
  • RoI, Region, Segment, Order: In addition to the right sidebar, these are the different options for processing and segmenting scans. While the options featured in the toolbar generally pertain to the current scan’s processing (see below), the right sidebar features project-wide options across all scans.

Right sidebar’s settings. fig. 15. Right sidebar’s settings.

However, the latter can be amended, changed or adjusted at any time. In this case, we recommend saving all previously carried-out settings, whether they be related to recognition parameters (‘parameters’) or to document-specific layout elements (‘regions’) previously determined by the user, in ‘settings’. This will ensure these particular settings are applied the next time you work with this tool, enabling you to work with document-specific settings from then on.

Specific settings: ‘regions’, ‘parameters’, ‘reading order’, ‘settings’

  • 'Regions': In accordance with the LAREX concept, each scan (that is, each book page) is composed of several, distinct layout elements, e.g. main text (‘paragraph’), title, marginalia, page number, etc. Thus, LAREX requires that corresponding ‘regions’ be assigned to each of these layout elements. This assigning task must be consistently performed throughout the entire work, in preparation for further steps as well as for the actual recognition of the displayed content! Besides a small number of pre-set and predefined layout regions – for instance ‘image’ (graphics such as woodcuts and ornate initials), ‘paragraph’ (main text) or ‘page_number’ – the user can define and add further book-specific layout regions under ‘create’. Not only can the user change a region’s colour, but they can also define the minimum size of a textual/graphical page element which they wish to recognize as such (under ‘minSize’). The layout region thus defined can be added to the book-specific list by clicking on ‘save’.

Range of options under ‘Regions’. fig. 16. Range of options under ‘Regions’.

  • Moreover, the ‘regions’ feature enables the user to assign particular layout regions to a fixed and predefined location on the scan which will then be applied to the following scans. Provided a page’s layout is repeated throughout the entire book, the user can generate something of a layout template in order to improve segmentation and reduce the number of necessary corrections later on. In order to adjust the position of these layout regions to a book’s specific layout, simply display the layout region’s current position and adjust it by selecting the scanned page’s regions.

Layout regions display and template. fig. 17. Layout regions display and template.

Once a new region has been defined, its position on the page can be established by clicking on ‘Region’ → ‘Create a region rectangle (Shortcut: 1)’, an option located in the toolbar. This can be undone or changed at any time. Please note that the category ‘images’ can’t be assigned to a layout region on the page.

Defining new layout regions. fig. 18. Defining new layout regions.

All things considered, it isn’t always advisable to assign fixed positions to all layout regions for an entire book; if the position of certain regions such as chapter titles, mottos, page number or signature marks on the different pages is inconsistent, assigning predefined positions will lead to recognition errors. In this case, manually verifying and correcting these layout elements afterwards is the more practical approach. If the user needs to delete a layout region’s position, they can simply select the region in question and press the ‘delete’ key.

  • 'Parameters': Allows to define overall parameters of image and text recognition. Taking the time to pre-set certain book-specific parameters is recommended when working with an inconsistent layout, particularly that of early modern prints. These often feature great divergences of word and line spacing. To avoid a narrowly spaced group of lines from being recognised as one cohesive textual element, the ‘text dilation’ feature enables you to control and define the text’s degree of dilation in the x- and y-direction. This will enable the software to recognise originally too close word/line spacing or to recognise widely spaced passages as one cohesive element. We recommend trying and testing in order to find the settings best suited to a particular book.

Parameters settings. fig. 19. Parameters settings.

  • 'Settings': Under ‘Settings’ you can save the previously selected displaying and segmentation options as well as loading them anew after an interruption in segmentation (buttons ‘save settings’ and ‘load settings’). Saving will generate an XML file which you will need to select when loading the settings (click on ‘load settings’, a new window will open; select file in question and open it). An additional feature will enable you to re-load previous pages’ segmentation results if you wish to view them again: simply go to ‘advanced settings’ and click on ‘load now’. This will load any previously saved XML file containing that page’s segmentation results.

Settings. fig. 20: Settings.

  • 'Reading Order': In order for the correct order of a page’s textual elements to be taken into account in all steps following segmentation, it is necessary to define these elements’ ‘reading order’ beforehand. This can be done automatically provided a book’s layout be relatively clear and simple. However, should you be working with a more complex layout structure, we recommend you proceed manually. Simply select ‘auto generate a reading order’ or ‘set a reading order’ under toolbar item ‘Order’.

Reading order selection in toolbar fig. 21. Reading order selection in toolbar

By clicking on the auto reading order button, a list of all the page’s textual elements will appear in the right sidebar (under ‘reading order’), sorted from top to bottom. On the other hand, if you wish to manually establish reading order, you will need to click on each of the page’s textual elements, in the correct order (see below), after which this reading order will appear in the aforementioned list. All elements of the reading order can be rearranged with a drag-and-drop or deleted by clicking on the corresponding recycle bin icon. As with everything in LAREX, the reading order can always be changed before saving the final segmentation results.

Exemplary page segmentation

With each page loading, LAREX automatically generates segmentation results – these only need to be subsequently corrected. The following, exemplary segmentation process uses page 4 of reference book Cirurgia, which you can download here when downloading the OCR4all folder structure.

Error analysis: Which layout elements were correctly recognised, which incorrectly, which weren’t at all? Are there any user marks in the margins, bordures, spots or elements of text which will influence segmentation, but you wish to avoid being recognised?

Auto generated results, Cirurgia page 4. fig. 22. Auto generated results, Cirurgia page 4.

'Region of interest' (RoI): Defining a RoI will help exclude certain sections of your page, situated outside the area later subjected to recognition but which can negatively impact segmentation (such as user marks, impurities, library stamps, etc.). To do so, go to toolbar and click on ‘Set the region of interest’ (under ‘RoI’), then use left click-and-hold to draw a rectangle around the page section you wish to segment.

Defining a 'region of interest'. fig. 23. Defining a 'region of interest'.

Once RoI has been defined, click on 'SEGMENT' button (right sidebar) – all element situated outside of RoI are now excluded from any further steps. Once RoI has been defined, it will be automatically transposed to all the book's scans. However, due to a wide array of factors, the page sections relevant to segmentation can shift from scan to scan. Therefore, as processing progresses, the user will likely need to adjust RoI from time to time. To do so, simply click on any RoI section and shift it using the mouse. Independently of RoI, the 'Create an ignore rectangle' option creates an 'ignore region' which allows for certain, small sections of a scan to be ignored and thus excluded from segmentation.

Correcting layout recognition flaws: Incorrectly recognized layout elements can be assigned a new typification manually: a right-hand click on said element will open a pop-up menu from which you can choose the correct designation.

Correcting a faulty typification. fig. 24. Correcting a faulty typification.

Should you need to separate a title from another textual element with which it is fused, there three ways to proceed: To begin, you can draw a rectangle around the section you wish to classify: proceed to toolbar, click on ‘Segment’ and select ‘Create a fixed segment rectangle’ (shortcut: 3); using mouse, draw a rectangle around the relevant section – a pop-up menu will appear from

which it’s correct designation/type can be chosen. Next, you can instead choose to use a polygon shape. This option is particularly suited to the more complex or chaotic layouts and/ or those comprising angled edges, rounded pictures and woodcuts, or ornate initials inside the text block. Proceed to toolbar, click on ‘Segment’, this time selecting ‘Create a fixed segment polygon’ option (shortcut: 4). Using the mouse, generate a dotted line to go around end encompass the entire relevant section – once the line’s end has been joined to its starting point, creating a polygon, the aforementioned pop-up menu will appear to allow for designation. Finally, you can also separate a text block – Initially recognized as one paragraph – into a title and main text using a cutting line: simply go to toolbar and ‘Segment’, and select ‘Create a cut line’ option (shortcut: 5).

Correcting a faulty typification. fig. 25. Toolbar: selecting cut line option.

Using left mouse key, create a line through the element you wish to separate, clicking along its path to adjust it as needed; end line with a double click.

Drawing a line between two layout elements to be separated. fig. 26. Drawing a line between two layout elements to be separated.

Click on 'Segment' in order to prompt separation. Afterwards, title element can be correctly renamed, using right-hand click and pop-up menu (as shown below).

Correcting typification of separated sections. fig. 27. Correcting typification of separated sections.

If at any time you with to delete layout components, inaccurate cutting lines or polygons, etc. simply click on the relevant element and use ‘Delete’ key or ‘Delete selected items’ option in the toolbar.

Determining 'Reading Order' (see below):

Determining reading order. fig. 28. Determining reading order.

Saving current scan’s segmentation results: Save your segmentation results by clicking on ‘Save results’ or with Ctrl + S. This will automatically generate an XML file containing those results inside the OCR4all folder structure.

Saving segmentation results. fig. 29. Saving segmentation results.

Afterwards, you can proceed to the next scan (left sidebar). If you wish to redo or change a scan’s segmentation, you can do as much at any time: simply save the new results – the previous XML file will be automatically deleted and replaced with a new version.

Additional processing options

OCR4all also provides the following scan processing options:

  • While deleting layout elements or joining separate ones to form one, single region, you can select all relevant elements simultaneously by pressing and holding 'Shift' key and drawing a rectangle around the entire region using your mouse. Relevant regions must be located entirely inside the rectangle. Once done, selected region will be surrounded by a blue frame.
  • 'Select contours to combine (with 'C') to segments (see function combine)' (shortcut: 6): this tool is perfect for reaching optimal segmentation results even when working with scans featuring a densely packed and detailed print layout. The basic idea is that layout elements only be delimited by the contours of the individual letters/pictures they are composed of, thus solving the problems created by manual segmentation such as excessively broad margins, which can in turn hamper the OCR performance. To use this feature, click on the relevant button (toolbar) or use shortcut 6. All components of the scan recognized as layout elements will be coloured blue.

Showing contours. fig. 30. Showing contours.

Select individual letters or even parts of letter by clicking on them.

Selecting contours. fig. 31. Selecting contours.

You can also apply your selection to an individual group of letters, entire words or text lines, sections of a layout element (see above: ‘Shift’ + selection with rectangle). Use shortcut C after selection in order to include all selected items – be they letters, words, lines, etc. – in one, new layout element, regardless of the layout region they had previously belonged to. This new element’s edges will be far more precise that those of an automatically generated one, enabling a particularly accurate segmentation superior to that of standardised tools.

Aggregating selected items to create new element. fig. 32. Aggregating selected items to create new element.

Save new element by clicking on ‘Segment’. New element can be renamed as described above.

Typifying new layout element. fig. 33. Typifying new layout element.

  • 'Combine selected segments or contours' (shortcut: C): In order to combine several, distinct layout elements into a new element, select the entire region in question (see above) and click on corresponding button (toolbar) or use shortcut C.
  • 'Fix/unfix segments, for it to persist a new auto segmentation' (Shortcut: F): This function enables you to fix an element in one place beyond your next segmentation rounds. Mark element in question by clicking on it, then use shortcut F or corresponding button in toolbar. Fixed, i. e. pinned elements will appear surrounded by a dotted line. If you wish to cancel fixation, simply repeat the operation.
  • Zoom: Use mouse wheel to zoom in and out of display. Use space key to reset display to its original size.
  • When working with a very complex and intricate layout, targeted interventions can help increase the precision and quality of segmentation results. The contours of all layout elements (recognized as such) consist in fact of many individual lines, separated by dots.

Layout element contours. fig. 34. Layout element contours.

  • These tiny dots can be moved, individually or in groups, e.g. to avoid collision between different layout elements in a dense setting. Use a left click-and-hold to move a dot, click on the line to create a new dot, use 'delete' key to delete a dot.
  • Load results: a scan’s existing segmentation results will be sourced from OCR4all folder structure and directly loaded to LAREX.

Final steps with LAREX

Once a document’s entire segmentation has been completed with LAREX (i.e. once segmentation results have been saved for all pages), results can be found in the OCR4all folder structure. In order to make sure that results were correctly saved, simply go to menu item ‘post correction’, in the ‘segments’ bar (see below).

Line Segmentation

Input: pre-processed images and segmentation information (in the form of PAGE XML files)
Output: extracted text lines saved in those PAGE XML files

  • This step constitutes a direct preparation to the OCR process and features the dissection of all previously defined and classified layout elements into separate text lines (this a necessary step as the OCR is based on line recognition). All results are then automatically saved in the corresponding page XML files.

Line segmentation settings. fig. 35. Line segmentation settings.

  • Generally speaking, all existing settings can be retained. There are, however, a few restrictions when it comes to page layout: if you are working with pages featuring two or more text columns (and if those have been previously defined as separate, individual main text blocks in LAREX), you will need to change the ‘maximum # of whitespace column separators’ which is pre-set at -1.
    • 'Whitespace column separators' are the white columns devoid of text found around text blocks.
    • When working with a two-column layout whose text is continuous (i.e. where the first line of both columns don’t form a semantic unit), you will need to set the ‘maximum # of whitespace column separators’ at 3. This number corresponds to the whitespace on both sides of the columns and to the whitespace situated between them.
    • When working with a three-column layout, set the 'whitespace' number to 4, and so on.
  • Once all desired settings are chosen, click on ‘execute’. Afterwards, control generated results under ‘Project Overview’.
  • Using the ‘settings (advanced)’ option is especially useful when working with line segmentation, particularly if/when errors are reported (and shown on the interface). For instance, small letters will often fall short of the default minimal line width. You can adjust this minimal width by reducing the ‘minimum scale permitted’, which can be found under menu item ‘limits’. This will enable you to correctly re-do the line segmentation.
  • You can generally control the accuracy of line segmentation by clicking on the ‘lines’ button (under menu item ‘post correction’).

Recognition

Input: Text lines and one or more OCR models
Output: OCR-output in the form of text for each of the PAGE XML files at hand

  • This step is where the actual text recognition takes place based on the individual lines and textual layout elements identified during line segmentation (see above).
  • Select menu item 'Recognition': in the right sidebar, you will only find your document's scans (or rather printed pages) for which all OCR pre-processing steps have been completed, by which we mean all previously explained steps - bar 'noise removal'. Please select the scans for which you wish to produce an OCR text.
  • Go to 'line recognition models' (under 'available') and select all models or model packages relevant to the typographical recognition of your text (e.g. early modern/historical Gothic type, italic/cursive type, historical Antiqua etc.). We expressly advise the use of a model package, where five models simultaneously work and interact with each other! This is much preferable to using only one model at a time. You can select all models you wish to add to your package by clicking on each of them - they will automatically be added to the 'selected' category. When dealing with a large amount of models, you can find them by using the 'search' function.

Selection of model package for text recognition. fig. 36. Selection of model package for text recognition.

  • You likely won't need to adjust any of the advanced settings.
  • Click on 'execute' and oversee the text recognition progress on the console.
  • Once recognition is finished, you will be able to view all results under menu item 'ground truth production'.

Ground Truth Production

Input: text line images and their corresponding OCR output when available
Output: line based ground truth

  • Under menu item 'ground truth production' you will be able to view the texts generated during 'recognition', correct them and save them as a training model. This is the so called 'ground truth'.
  • The correction tool used in this step is divided into two parts. On the left handside are the (selectable) scans. In the middle, you will find the segmented text line images (see above for workflow) as well as their corresponding OCR text lines, placed directly underneath. We call this standard display 'text view'.

Ground truth production with 'text view'. fig. 37. Ground truth production with 'text view'.

Clicking on the 'Switch to page view' button will bring you to the so called 'page view' display, in which you can work on all text lines while they are displayed in relation to the entire page layout. By clicking 'switch to text view', you will return to the initial 'text view' display.

Ground truth production with 'page view'. fig. 38. Ground truth production with 'page view'.

  • On the right hand side of the display, you will find the virtual keyboard, with which you can set special characters such as ligatures, abbreviation, diacritical signs etc. Simply place your cursor where you with to insert a special character and then click on said character in the virtual keyboard. In order to add new characters to the virtual keyboard, simply click on the plus icon, add character through copy and paste in the blank and click on 'save'. if you wish to delete characters from the virtual keyboard, drag and drop said character on the recycle bin icon. Once all necessary/desired changes have been made, click on 'save' and 'lock'. Using buttons 'lad' and 'save' will ultimately enable you to save different virtual keyboards specific to any particular document. Once a virtual keyboard has been saved as such, it can be re-loaded at any time, which is particularly useful when you need to interrupt correction - or if you want to use this keyboard for another document for which it is suited.
  • In order to correct individual lines in 'text view' mode, click on the line in question: you can now correct and edit it. (When working with 'page view', you will need to click on the line you wish to edit first, after which a text field will appear in which you will be able to proceed to corrections/edits as well. Use 'tabulator' key to go to the next line, and so on. All following steps are identical in both viewers. Once a text line has been completely and satisfactorily corrected, press 'enter key'. The line will be coloured green, meaning it will be automatically saved as 'ground truth' in OCR4all once the entire page has been completed and saved (by clicking on 'save result' or using shortcut crtl + S). Once a line has been identified as ground truth, it can be used as a basis for OCR training as well as a tool to evaluate the OCR model you used.
  • If there are erroneously recognised text line images among your pairs of text lines images and corresponding OCR text lines, please let your OCR text lines unfilled to not cause problems during the OCR model training.
  • Were you to conclude, while working on ground truth production, that the quality of the text recognition achieved with mixed models wasn't satisfactory, you can always perform a final, manual text correction by employing a training model targeted towards the specific kind of document you are working on. Proceeding to this step will generally increase the recognition quality and percentage.

Evaluation

Input: line based OCR texts and corresponding ground truth
Output: error statistics

  • Under menu item 'evaluation', users can check on the recognition rate of the model(s) currently under use.

  • In order to generate an evaluation, go to right sidebar and select all the scans recognized with the help of said model and subsequently corrected during 'ground truth production'.

  • Click on 'execute': a chart will appear in the console. At the top, you will see the percentage of errors as well as the full count of errors ('errs'). All identified errors are listed underneath, displayed as a chart featuring the comparison between the initially recognized text ('PRED', righthand column) and the results of ground truth production ('GT', lefthand column). Behind each error item, you will see the frequency of that particular type of error as well as its percentage compared to the entire error count.

Evaluation results with general error rate, ten most frequent errors as well as their percentage
+compared to entire error count. fig. 39. Evaluation results with general error rate, ten most frequent errors as well as their percentage compared to entire error count.

  • Thanks to the spreadsheet and its display (100% - error rate), users can evaluate whether a new training using individual, targeted models is necessary.

Training

Input: text line images with corresponding ground truth (as an option, existing OCR models can be included as well, which are used as so called 'pre-training' and as basis for model training
Output: one or more OCR model(s)

The aim of our software is to produce a text containing as few errors as possible. In that case, why is even necessary to use the training module and produce models targeted to your document, instead of simply correcting it manually? In fact, the better a recognition model the shorter the correction time. The idea of a continuous model training is to train increasingly better models through continuous corrections, which in turn will reduce the amount of corrections needed for the next pages, and so on.

  • With this training tool, users will be able to train models tailored to their document, based on the lines of ground truth available for this document. In order to begin training, please proceed to the following adjustments in general settings:
    • Set the 'Number of folds to train' (i.e. the number of models to train) to 5. → Training will occur with a model package containing five individual models.
    • 'Only train a single fold box': please don't fill out this box!
    • Set the 'Number of models to train in parallel' at -1. → All training models will be trained simultaneously.
    • If all characters contained in the pretraining model need to be kept in the model you wish to train (i.e. added to its so called whitelist), please check the 'Keep codec of the loaded model(s)' box.
    • In effect, the 'Whitelist characters to keep in the model' is the exhaustive list of characters used during training and in the subsequently generated model. Any character not contained in the whitelist won't be included in the process.
    • 'Pretraining': Either 'Train each model based on different existing models' (a menu will appear containing five dropdown lists. Inside each of them, enter one of the five models belonging to the model package used as advised earlier. Regardless of the training step (be it the first round or the third), always enter the five models used since the beginning) or 'Train all models based on one existing model' (click on this setting if you started training using only one model. Simply select that exact training model for each repetition of the training process).
    • 'Data augmentation': Please don't fill out this box! This function describes the data augmentation per line. Users can enter a number, e.g. 5, in order to increase the amount of training material. This can lead to the generation of better performing models. However, this process is more time-costly than the standard route.
    • 'Skip retraining on real data only': Please don't fill out this box!
  • The advanced settings remain unchanged.

Settings for the training of document-specific models. fig. 40. Settings for the training of document-specific models.

  • Click on 'execute' to start training. You will be able to view the training progress at any time in the console. Training time will vary depending on the total amount of ground truth lines.
  • In accordance with the aforementioned settings, a model package (containing five individual models and tailored to your document's exact needs) will be generated through training and automatically saved in folder ocr4all/models/document title/0. Going forward, this model package will be labelled '0'. From this point on, while working on this document and striving towards improving recognition, you will be able to select said package under menu item 'recognition' among other models, when working with new pages from the same document. If you wish to generate a second document-specific model package (e.g. to improve the first one's weaknesses), simply repeat the process as described above. This new model will be labelled '1', and so on.

Post Correction

Input: segmentation information and metadata on pre-processed scans, as well as the corresponding text
Output: corrected/improved segmentation info and text Under menu item 'post correction', users will be able to manually adjust and correct all segmentation info and text generated through the course of the previous sub-modules. This sub-module is itself divided into three levels:

  • The item 'segment' (i.e. level 1) will enable you to adjust all regions determined during segmentation and their reading order, page after page. You will recognize a few of the tools from working with LAREX (see above). Please note that all changes undertaken at this level will have consequences for the following levels. For example, if you decide to delete a certain region during level 1, you will loose all text lines belonging to this region going forward.
  • The 'lines' item (i.e. level 2) enables you to manually adjust automatic line recognition. You will be able to add lines where there were none, to change their shape or position, or to delete them. The reading order can be adjusted as well, on a line basis.

Adjusting line-based reading order during post correction. fig. 41. Adjusting line-based reading order during post correction.

  • Under item 'text' (i.e. level 3), you will find the afore-described ground truth submodule, in which the text content of your lines can be corrected once more.

Result Generation

Input: line-based OCR results, ground truth (optional - only if at hand) and the LAREX-segmentation and line-segmentation data
Output: final text output (lines will be re-grouped into pages and full-text) as well as page based PAGE XML

Result Generation. fig. 42: Result Generation.

  • Once the user considers all recognition and correction steps to be finalized, results can be generated as TXT or XML files, saved under ocr4all/data/results.
  • You can choose whether you need a text or PAGE XML file under 'settings'. If you opt for a text file, individual TXT files will be generated for each scan as well as an additional one containing your document's entire text.
  • PAGE XML files are also generated on a page-base and additionally contain data about creation date, last changes in the file, metadata about each page's corresponding scan, about the page's size, its layout regions and their exact coordinates, its reading order, its text lines and their text content.
+ \ No newline at end of file diff --git a/hashmap.json b/hashmap.json index 12457174..9e84bafa 100644 --- a/hashmap.json +++ b/hashmap.json @@ -1 +1 @@ -{"about_activities.md":"D4dF5vOZ","about_ocr4all.md":"KnWC4XWQ","about_projects.md":"Cd7nA-A5","about_team.md":"DJARzRrq","guide_setup-guide_linux.md":"Dw_nI6Bl","guide_setup-guide_macos.md":"3kOmZ-Mw","guide_setup-guide_quickstart.md":"DYx-lcYS","guide_setup-guide_windows.md":"DWUfcSce","guide_user-guide_common-errors.md":"CnzWhStX","guide_user-guide_introduction.md":"C5p2P9wk","guide_user-guide_project-start-and-overview.md":"D1LNg9Eo","guide_user-guide_scan-preparation.md":"BsXxbD_4","guide_user-guide_setup-and-folder-structure.md":"DNL9jmhv","guide_user-guide_workflow.md":"BxdBwB_I","index.md":"BMIi4gsl"} +{"about_activities.md":"D1i_SJGN","about_ocr4all.md":"OICOy45v","about_projects.md":"BN7d1s9v","about_team.md":"S7gO9Vzt","beta_index.md":"CMmb8dlA","beta_introduction.md":"CVgK8JtF","beta_setup.md":"CyQ7iTlu","guide_setup-guide_linux.md":"BkF1FTXB","guide_setup-guide_macos.md":"CZq3J_KB","guide_setup-guide_quickstart.md":"Bl7ugQhE","guide_setup-guide_windows.md":"UyjaupDu","guide_user-guide_common-errors.md":"BlhiEKoZ","guide_user-guide_introduction.md":"Cg1IJE4k","guide_user-guide_project-start-and-overview.md":"Biix-mp8","guide_user-guide_scan-preparation.md":"DpWU4ver","guide_user-guide_setup-and-folder-structure.md":"Ci0ktCk8","guide_user-guide_workflow.md":"CNo_DpnQ","index.md":"BC2x8Von"} diff --git a/index.html b/index.html index f5ec8273..220e129f 100644 --- a/index.html +++ b/index.html @@ -12,13 +12,13 @@ - + -
Skip to content

OCR4all

Optical Character Recognition (and more) for everyone

Setup guide, user guide, developer documentation and more.

OCR4all
- +
Skip to content

OCR4all

Optical Character Recognition (and more) for everyone

Setup guide, user guide, developer documentation and more.

OCR4all
+ \ No newline at end of file