-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integration of ALTO metadata #12
Comments
Or the tool needs a new name? Let's discuss what makes sense, e.g. libraries for parsing vs. integration Some initial ideas on what would be relevant in ALTO:
|
The other thing we should consider: ALTO and images (+ their metadata) concern document pages, not documents. So this is either
|
I am leaning towards
as only this would allow us the most granular analysis down to page level (e.g. which pages are outliers concerning {insert feature here}, but again, let's discuss this further with the team. |
I'm working on an "altotool", using the same techniques we use in modstool to stuff interesting stuff into a pandas DataFrame. This will be indexed by page, so any analysis to be done document-wise needs to aggregate this in meaningful way then (e.g. merge processing info, build sums of line counts etc.) For some of this stuff there needs to done some aggregating over the page already, e.g. mean/median OCR confidence etc. (not that I think this info will be particularily useful but we shall see.) |
@cneud Please clarify this. |
There are and have been many use cases for info extracted from ALTO which led me to work on https://github.com/cneud/alto-tools (feel free to reuse what can be).
Think e.g. on which pages within a document do certain elements occur vs other pages |
When certain pages within a document are outliers wrt to the confidence scores, this would be useful to identify and investigate for example |
And yeah it needs a name. While I am happy with the innovative name of "modstool", with ALTO functionality it's a bit different |
A first version (branch
This includes some counts of elements ( A bit of a stumbling block is the diversity of ALTO variants we have, so I am going to rework this not to use a fixed XML namespace. |
From the first runs I estimate about 48h to run this over all of our (5 million?) ALTO files, which is fine with me. |
Latest version in the feature branch now includes descriptive statistics on the word OCR confidence (
|
Latest version now includes the column Examples from my test data:
|
NER annotated ALTO at SBB looks like this: There's an
|
Latest master now counts the above NEs in |
We now count all Strings with TAGREFS in |
TAGREFS is also used in some ALTO files to reference However, I don't think it's currently worth the effort to check if the |
Language attributes are
|
Moved this to #18. |
I've deviced to ignore this for now. |
Should this be in here? Or in "codename altotool"?
What info would be relevant? What would be metadata, what would be data (count words?)
Include metadata from the
Description
sectionInclude descriptive statistics for the
Layout
section etc.When that's done review the comments below for things we may have missed
Test using all available versions of ALTO
NER annotated ALTO should at least be identifiable
Include ALTO version/namespace
<LayoutTag ID="layouttag-marginalia" LABEL="marginalia"/>
Any language infos?
Update README that we now support ALTO
The text was updated successfully, but these errors were encountered: