Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hyphenated words #22

Open
jbarth-ubhd opened this issue Mar 22, 2024 · 4 comments
Open

Hyphenated words #22

jbarth-ubhd opened this issue Mar 22, 2024 · 4 comments

Comments

@jbarth-ubhd
Copy link

Dear reader,
does keraslm-rate take hyphenated words into account?

Using this demo file https://digi.ub.uni-heidelberg.de/diglitData/v/keraslm/test-fouche10,5-s1.pdf

It seems that many of the low rated words have hyphens:

With hyphenation:

# median: 0.962098 0.622701 ; mean: 0.948695 0.625144, correlation: 0.315179
# OCR-D-OCR OCR-D-KERAS
0.693236 0.410939  # region0002_line0021_word0003 daf3
0.927003 0.468318  # region0002_line0029_word0006 Rä-
0.932888 0.480686  # region0002_line0021_word0002 Lyon,
0.904642 0.484226  # region0002_line0032_word0001 Kerker.
0.909297 0.484817  # region0002_line0032_word0004 klaubt
0.931271 0.489822  # region0002_line0000_word0005 pas-
0.928169 0.491138  # region0000_line0004_word0007 sozia-
0.927566 0.492916  # region0002_line0014_word0003 Pythia;
0.958217 0.494058  # region0000_line0002_word0003 Lyon,
0.963757 0.494978  # region0003_line0001_word0005 Lyon,
0.926153 0.495819  # region0003_line0000_word0004 Kon-
0.960306 0.496031  # region0002_line0010_word0007 Lyon
0.911557 0.496326  # region0002_line0001_word0004 Rousseaus
0.967390 0.496934  # region0000_line0011_word0003 1792
0.929831 0.497394  # region0002_line0004_word0003 im
0.960453 0.498529  # region0002_line0017_word0006 Lyon
0.910209 0.499826  # region0002_line0018_word0002 Instinktiv
...

Without (manually removed) hyphenation:

# median: 0.962198 0.623943 ; mean: 0.949162 0.628181, correlation: 0.278264
# OCR-D-OCRNOHYP OCR-D-KERNOHYP
0.693236 0.411037  # region0002_line0021_word0003 daf3
0.932888 0.480686  # region0002_line0021_word0002 Lyon,
0.904642 0.484226  # region0002_line0032_word0001 Kerker.
0.909297 0.484817  # region0002_line0032_word0004 klaubt
0.927566 0.492916  # region0002_line0014_word0003 Pythia;
0.958217 0.494058  # region0000_line0002_word0003 Lyon,
0.963757 0.494945  # region0003_line0001_word0005 Lyon,
0.960306 0.496031  # region0002_line0010_word0007 Lyon
0.911557 0.496306  # region0002_line0001_word0004 Rousseaus
0.967390 0.496923  # region0000_line0011_word0003 1792
0.929831 0.497394  # region0002_line0004_word0003 im
0.960453 0.498542  # region0002_line0017_word0006 Lyon
0.910209 0.499822  # region0002_line0018_word0002 Instinktiv
...
@jbarth-ubhd
Copy link
Author

keras.csv     :0.927003 0.468318  # region0002_line0029_word0006 Rä-
kerasNOHYP.csv:0.927003 0.573203  # region0002_line0029_word0006 Rädelsführer

@bertsky
Copy link
Collaborator

bertsky commented Mar 22, 2024

No, hyphens are not treated specially in any way. If you are using model_dta_full.h5, that LM was trained on Deutsches Textarchiv Kernkorpus und Ergänzungstexte plaintext edition, which does contain original type alignment (Zeilenfall, i.e. line breaks), so the model has "seen" hyphens and newlines. However, these texts are very diverse – some contain nearly no hyphens, others make lots of use of it. So I am not sure how well the model really learned to abstract line breaks as a general possibility.

I have not specifically and methodically measured the impact of hyphenation myself, as you have. So thanks for the analysis!

In light of this, perhaps the model should indeed be applied with a dedicated rule: if there's a hyphen-like character at the end of a line, then

  • in explicit state transfer mode (for example with alternative decoding): keep the LM state right before the hyphen and continue with it after the line break
  • in linear mode: remove the hyphen (rating it with a fixed score) and the newline character, maybe insert a newline after the token (ignoring its probability output)

@bertsky
Copy link
Collaborator

bertsky commented Mar 22, 2024

...and if inference mode does it this way, then training should explicitly mask all hyphens on the input side, too.

I am not even sure whether I should retain line breaks (newline character) as such.

@bertsky
Copy link
Collaborator

bertsky commented Apr 4, 2024

Note: meanwhile, I found out that the plaintext version of DTA produced via dta-tools tei2txt has more problems for our use-case:

  • still contains marginals, footnotes and endnotes (tei:note)
  • catchword, print signature, page number, running header (tei:fw) should all be removed, but in past versions there was interference with line breaks
  • line breaks with line identifiers (tei:lb/@n), esp. in poems, used to be printed verbatim
  • still contains line breaks with ¬ as hyphen sometimes (coverage of dehyphenation rule not 100%)
  • formulae get printed as [FORMEL]
  • contains title pages, indexes, tables and figures, too
  • usage of tab stop in replacement rules

Moreover, there are problems with the textual quality of DTA extended set (Ergänzungstexte):

  • contains lots of mathematical symbols (often occuring only in one document), because of missing tei:formula markup
  • contains lots of musical symbols, box drawing characters, Canadian syllabics – too sparse
  • accidental usage of similar looking glyphs
  • inconsistent usage of punctuation symbols (esp. quotes, dashes, brackets, indexes)
  • some usage of inverted characters for respective printing errors
  • some usage of Fraktur hyphen despite transcription guidelines requiring standard hyphen-minus
  • some usage of Fraktur consonant ligatures despite guidelines
  • rare presence of byte-order mark, object replacement character, soft hyphen
  • generally a long tail of rare symbols which would spoil the LM due to sparseness
  • usage of _ as gap character (which the training did not take into account until now)

So I decided to do my own plaintext export from the TEI version, which solves all that – and takes another bold step: it now uses Unicode NFKD normalization, because precomposed characters are much harder to learn (being sparse), esp. with such productive combinations as in polytonic Greek.

I will make some further code modifications to the training procedure (coverage of explicit gap codepoint in the input, cutoff frequency for implicit gaps in the input) and the inference side (removal of normal line breaks and dehyphenation, applying NFKD and other string normalization, persisting the string preprocessor to the model config) and then retrain the DTA model.

Until then, the issue will stay open. If you have additional ideas, please comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants