Integration of ALTO metadata #12

mikegerber · 2022-04-07T16:42:45Z

cneud · 2022-04-07T16:46:00Z

Or the tool needs a new name? Let's discuss what makes sense, e.g. libraries for parsing vs. integration

Some initial ideas on what would be relevant in ALTO:

Processing provenance (which software was used to produce ALTO)
OCR confidences
count of words, lines
presence of particular elements (e.g. graphical elements)
distribution of ALTO elements over a document

mikegerber · 2022-04-07T16:50:53Z

The other thing we should consider: ALTO and images (+ their metadata) concern document pages, not documents. So this is either

to be aggregated over the whole document
or treated as a separate data source (indexed by document + page)
or both

cneud · 2022-04-07T16:54:59Z

I am leaning towards

treated as a separate data source (indexed by document + page)

as only this would allow us the most granular analysis down to page level (e.g. which pages are outliers concerning {insert feature here}, but again, let's discuss this further with the team.

mikegerber · 2022-04-08T15:34:58Z

I'm working on an "altotool", using the same techniques we use in modstool to stuff interesting stuff into a pandas DataFrame. This will be indexed by page, so any analysis to be done document-wise needs to aggregate this in meaningful way then (e.g. merge processing info, build sums of line counts etc.)

For some of this stuff there needs to done some aggregating over the page already, e.g. mean/median OCR confidence etc. (not that I think this info will be particularily useful but we shall see.)

mikegerber · 2022-04-08T15:36:40Z

* distribution of ALTO elements over a document

@cneud Please clarify this.

cneud · 2022-04-08T15:40:21Z

There are and have been many use cases for info extracted from ALTO which led me to work on https://github.com/cneud/alto-tools (feel free to reuse what can be).

Please clarify this.

Think e.g. on which pages within a document do certain elements occur vs other pages

cneud · 2022-04-08T15:44:30Z

mean/median OCR confidence

When certain pages within a document are outliers wrt to the confidence scores, this would be useful to identify and investigate for example

mikegerber · 2022-04-11T16:47:53Z

And yeah it needs a name. While I am happy with the innovative name of "modstool", with ALTO functionality it's a bit different

mikegerber · 2022-05-10T11:30:55Z

A first version (branch feat/alto) extracts some of this information, e.g.:

Description_MeasurementUnit                                                                              pixel                  
Description_OCRProcessing_ocrProcessingStep0_processingDateTime                                     2016-08-07
Description_OCRProcessing_ocrProcessingStep0_processingSoftware_softwareCreator                          ABBYY
Description_OCRProcessing_ocrProcessingStep0_processingSoftware_softwareName           ABBYY FineReader Engine
Description_OCRProcessing_ocrProcessingStep0_processingSoftware_softwareVersion                             11
Layout_Page_ID                                                                                           Page1
Layout_Page_PHYSICAL_IMG_NR                                                                                  1
Layout_Page_HEIGHT                                                                                        2436
Layout_Page_WIDTH                                                                                         1404
Layout_Page_Page-count                                                                                       1
Layout_Page_TopMargin-count                                                                                  1
Layout_Page_LeftMargin-count                                                                                 1
Layout_Page_RightMargin-count                                                                                1
Layout_Page_BottomMargin-count                                                                               1
Layout_Page_PrintSpace-count                                                                                 1
Layout_Page_TextBlock-count                                                                                  1
Layout_Page_Shape-count                                                                                      1
Layout_Page_Polygon-count                                                                                    1
Layout_Page_TextLine-count                                                                                  40
Layout_Page_String-count                                                                                   386
Layout_Page_SP-count                                                                                       345
Layout_Page_HYP-count                                                                                        8
alto_file                                                                          alto/734008031/00000035.xml
Layout_Page_GraphicalElement-count                                                                         NaN
Layout_Page_Illustration-count                                                                             NaN
Layout_Page_ComposedBlock-count                                                                            NaN

This includes some counts of elements (*-count) and also selected attribute values (e.g. Layout_Page_HEIGHT), more to come.

A bit of a stumbling block is the diversity of ALTO variants we have, so I am going to rework this not to use a fixed XML namespace.

mikegerber · 2022-05-10T11:44:37Z

From the first runs I estimate about 48h to run this over all of our (5 million?) ALTO files, which is fine with me.

mikegerber · 2022-05-23T17:50:46Z

Latest version in the feature branch now includes descriptive statistics on the word OCR confidence (//alto:String/@WC as an XPath expression):

Layout_Page_//alto:String/@WC-mean                                                                    0.639988
Layout_Page_//alto:String/@WC-median                                                                    0.6355
Layout_Page_//alto:String/@WC-std                                                                     0.137451
Layout_Page_//alto:String/@WC-min                                                                         0.22
Layout_Page_//alto:String/@WC-max                                                                            1

mikegerber · 2022-06-08T16:26:59Z

Latest version now includes the column alto_xmlns, which is/translates to the ALTO version used.

Examples from my test data:

alto/PPN636777308/00000002.xml             http://schema.ccs-gmbh.com/ALTO
alto/734008031/00000020.xml       http://www.loc.gov/standards/alto/ns-v2#
alto/734008031/00000054.xml       http://www.loc.gov/standards/alto/ns-v2#
alto/734008031/00000098.xml       http://www.loc.gov/standards/alto/ns-v2#
alto/734008031/00000106.xml       http://www.loc.gov/standards/alto/ns-v2#
                                                    ...                   
alto/749782137/00000554.xml       http://www.loc.gov/standards/alto/ns-v2#
alto/749782137/00000252.xml       http://www.loc.gov/standards/alto/ns-v2#
alto/749782137/00000004.xml       http://www.loc.gov/standards/alto/ns-v2#
alto/749782137/00000849.xml       http://www.loc.gov/standards/alto/ns-v2#
alto/weird-ns/00000007.xml              http://www.loc.gov/standards/alto/
Name: alto_xmlns, Length: 1314, dtype: object

mikegerber · 2022-06-15T15:31:07Z

NER annotated ALTO at SBB looks like this:

There's an alto:Tags tag that contains the entities (ns0 being ALTO here):

  <ns0:Tags>
    <ns0:NamedEntityTag ID="PER0" LABEL="Pentlings"/>
    <ns0:NamedEntityTag ID="LOC1" LABEL="Pentling"/>
    <ns0:NamedEntityTag ID="LOC2" LABEL="Hamm"/>
    <ns0:NamedEntityTag ID="PER4" LABEL="Hofes Pentling"/>
    <ns0:NamedEntityTag ID="LOC5" LABEL="Hofs Pentling"/>
    <ns0:NamedEntityTag ID="LOC7" LABEL="Hilbeck"/>
    <ns0:NamedEntityTag ID="PER8" LABEL="Hoff"/>
    <ns0:NamedEntityTag ID="PER9" LABEL="L i b e r"/>
    <ns0:NamedEntityTag ID="PER10" LABEL="Jhesu Christi"/>
  </ns0:Tags>

alto:Strings then reference these:

            <ns0:String CONTENT="Hofes" HEIGHT="33" HPOS="914" TAGREFS="PER4" VPOS="1396" WC="0.5019999743" WIDTH="82"/>
            <ns0:SP HPOS="997" VPOS="1398" WIDTH="21"/>
            <ns0:String CONTENT="Pentling" HEIGHT="34" HPOS="1019" TAGREFS="PER4" VPOS="1398" WC="0.5337499976" WIDTH="129"/>
            <ns0:SP HPOS="1149" VPOS="1407" WIDTH="19"/>

mikegerber · 2022-06-17T15:34:32Z

Latest master now counts the above NEs in Tags_NamedEntityTag-count.

mikegerber · 2022-06-17T16:01:05Z

We now count all Strings with TAGREFS in Layout_Page_//alto:String[@TAGREFS]-count (Weird naming comes from the XPath expression used). Some tagged entities span multiple String elements, not sure if and what to do about that.

mikegerber · 2022-06-17T16:04:45Z

We now count all Strings with TAGREFS in Layout_Page_//alto:String[@TAGREFS]-count (Weird naming comes from the XPath expression used). Some tagged entities span multiple String elements, not sure if and what to do about that.

TAGREFS is also used in some ALTO files to reference LayoutTags in TextBlock elements (not Strings). So technically these counts could count reference tags that are not NamedEntityTags.

However, I don't think it's currently worth the effort to check if the TAGREFS actually reference NEs and just leave it this way until we need this checking. @labusch @cneud Opinions?

mikegerber · 2022-06-21T10:49:40Z

Language attributes are LANG and the deprecated language:

https://www.loc.gov/standards/alto/v4/alto-4-3.xsd
In my test data, only the deprecated TextBlock/@language tags are used

mikegerber · 2022-06-21T11:00:05Z

Language attributes are LANG and the deprecated language:

* https://www.loc.gov/standards/alto/v4/alto-4-3.xsd

* In my test data, only the deprecated `TextBlock/@language` tags are used

Moved this to #18.

mikegerber · 2022-06-21T11:07:19Z

<LayoutTag ID="layouttag-marginalia" LABEL="marginalia"/>

I've deviced to ignore this for now.

mikegerber self-assigned this Apr 11, 2022

mikegerber changed the title ~~Consider integration of ALTO metadata~~ Integration of ALTO metadata Apr 11, 2022

mikegerber mentioned this issue May 24, 2022

Better name for altotool #15

Open

4 tasks

mikegerber closed this as completed Jun 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration of ALTO metadata #12

Integration of ALTO metadata #12

mikegerber commented Apr 7, 2022 •

edited

Loading

cneud commented Apr 7, 2022 •

edited

Loading

mikegerber commented Apr 7, 2022

cneud commented Apr 7, 2022

mikegerber commented Apr 8, 2022 •

edited

Loading

mikegerber commented Apr 8, 2022

cneud commented Apr 8, 2022

cneud commented Apr 8, 2022

mikegerber commented Apr 11, 2022

mikegerber commented May 10, 2022

mikegerber commented May 10, 2022

mikegerber commented May 23, 2022 •

edited

Loading

mikegerber commented Jun 8, 2022

mikegerber commented Jun 15, 2022

mikegerber commented Jun 17, 2022

mikegerber commented Jun 17, 2022

mikegerber commented Jun 17, 2022 •

edited

Loading

mikegerber commented Jun 21, 2022

mikegerber commented Jun 21, 2022

mikegerber commented Jun 21, 2022

Integration of ALTO metadata #12

Integration of ALTO metadata #12

Comments

mikegerber commented Apr 7, 2022 • edited Loading

cneud commented Apr 7, 2022 • edited Loading

mikegerber commented Apr 7, 2022

cneud commented Apr 7, 2022

mikegerber commented Apr 8, 2022 • edited Loading

mikegerber commented Apr 8, 2022

cneud commented Apr 8, 2022

cneud commented Apr 8, 2022

mikegerber commented Apr 11, 2022

mikegerber commented May 10, 2022

mikegerber commented May 10, 2022

mikegerber commented May 23, 2022 • edited Loading

mikegerber commented Jun 8, 2022

mikegerber commented Jun 15, 2022

mikegerber commented Jun 17, 2022

mikegerber commented Jun 17, 2022

mikegerber commented Jun 17, 2022 • edited Loading

mikegerber commented Jun 21, 2022

mikegerber commented Jun 21, 2022

mikegerber commented Jun 21, 2022

mikegerber commented Apr 7, 2022 •

edited

Loading

cneud commented Apr 7, 2022 •

edited

Loading

mikegerber commented Apr 8, 2022 •

edited

Loading

mikegerber commented May 23, 2022 •

edited

Loading

mikegerber commented Jun 17, 2022 •

edited

Loading