-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hyphenated words #22
Comments
|
No, hyphens are not treated specially in any way. If you are using model_dta_full.h5, that LM was trained on Deutsches Textarchiv Kernkorpus und Ergänzungstexte plaintext edition, which does contain original type alignment (Zeilenfall, i.e. line breaks), so the model has "seen" hyphens and newlines. However, these texts are very diverse – some contain nearly no hyphens, others make lots of use of it. So I am not sure how well the model really learned to abstract line breaks as a general possibility. I have not specifically and methodically measured the impact of hyphenation myself, as you have. So thanks for the analysis! In light of this, perhaps the model should indeed be applied with a dedicated rule: if there's a hyphen-like character at the end of a line, then
|
...and if inference mode does it this way, then training should explicitly mask all hyphens on the input side, too. I am not even sure whether I should retain line breaks (newline character) as such. |
Note: meanwhile, I found out that the plaintext version of DTA produced via dta-tools tei2txt has more problems for our use-case:
Moreover, there are problems with the textual quality of DTA extended set (Ergänzungstexte):
So I decided to do my own plaintext export from the TEI version, which solves all that – and takes another bold step: it now uses Unicode NFKD normalization, because precomposed characters are much harder to learn (being sparse), esp. with such productive combinations as in polytonic Greek. I will make some further code modifications to the training procedure (coverage of explicit gap codepoint in the input, cutoff frequency for implicit gaps in the input) and the inference side (removal of normal line breaks and dehyphenation, applying NFKD and other string normalization, persisting the string preprocessor to the model config) and then retrain the DTA model. Until then, the issue will stay open. If you have additional ideas, please comment. |
Dear reader,
does keraslm-rate take hyphenated words into account?
Using this demo file https://digi.ub.uni-heidelberg.de/diglitData/v/keraslm/test-fouche10,5-s1.pdf
It seems that many of the low rated words have hyphens:
With hyphenation:
Without (manually removed) hyphenation:
The text was updated successfully, but these errors were encountered: