-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lemma of compound words contains only the headword #3
Comments
Yes, the lemma column is a copy from the "base" annotation in the original HDT annotation. I thought we doxumented this somewhere, but I don't remember where. |
Thanks for the quick reply! I think it would be possible to restore the complete lemma from the word form and the headword using a script. Would you consider merging if I did a PR on this? Or should I just create a fork? |
Sorry, I forgot this issue. I think that the idea of creating the Lemma from the word and the base annotation can be sensible and I will have a closer look at the effect of the script in #6. If the script works well enough, we could also use it in the publication pipeline. I made a TODO to look into it next week. In any case, can you add a proper header to the scripts including the license (i.e. Apache 2.0 or GPLv3 (or later)) and a copyright notice with yourself as the author? |
I've noticed that for most compound words only the headword is stored in the lemma. This mainly concerns nouns as in the following examples:
but also adjectives:
However, there are examples where the whole compound is given in the lemma:
Is it an artifact of converting the original treebank to UD format?
The text was updated successfully, but these errors were encountered: