lemma of compound words contains only the headword #3

pekoli · 2020-06-08T14:25:30Z

I've noticed that for most compound words only the headword is stored in the lemma. This mainly concerns nouns as in the following examples:

# sent_id = hdt-s10009
7       Leitungsinfrastruktur   Infrastruktur   NOUN    NN      Gender=Fem|Number=Sing|Person=3 2       obj     _       _

# sent_id = hdt-s10011
6       Stellenstreichungen     Streichung      NOUN    NN      Gender=Fem|Number=Plur|Person=3 4       conj    _       _

# sent_id = hdt-s10015
17      Vorstandvorsitzender    Vorsitzender    NOUN    NN      Case=Nom|Gender=Masc|Number=Sing|Person=3       16      nsubj   _

but also adjectives:

# sent_id = hdt-s10005
2       US-amerikanische        amerikanisch    ADJ     ADJA  Degree=Pos|Gender=Neut|Number=Sing      3       amod    _       _

However, there are examples where the whole compound is given in the lemma:

# sent_id = hdt-s10012
14      Geschäftsjahres Geschäftsjahr   NOUN    NN      Case=Gen|Gender=Neut|Number=Sing|Person=3       11      nmod:poss       _       _

Is it an artifact of converting the original treebank to UD format?

The text was updated successfully, but these errors were encountered:

akoehn · 2020-06-08T14:32:26Z

Yes, the lemma column is a copy from the "base" annotation in the original HDT annotation. I thought we doxumented this somewhere, but I don't remember where.

pekoli · 2020-06-08T15:35:42Z

Thanks for the quick reply!
The papers linked in the README don't mention it explicitly if I haven't missed it.

I think it would be possible to restore the complete lemma from the word form and the headword using a script. Would you consider merging if I did a PR on this? Or should I just create a fork?
(Background is training neural lemmatizers - currently, they're forced to learn compound splitting in addition to lemmatisation which doesn't make it easier...)

akoehn · 2020-09-16T07:47:53Z

Sorry, I forgot this issue.

I think that the idea of creating the Lemma from the word and the base annotation can be sensible and I will have a closer look at the effect of the script in #6. If the script works well enough, we could also use it in the publication pipeline. I made a TODO to look into it next week.

In any case, can you add a proper header to the scripts including the license (i.e. Apache 2.0 or GPLv3 (or later)) and a copyright notice with yourself as the author?

This was referenced Sep 10, 2020

Restore full lemma #5

Closed

script and unit test #6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lemma of compound words contains only the headword #3

lemma of compound words contains only the headword #3

pekoli commented Jun 8, 2020

akoehn commented Jun 8, 2020

pekoli commented Jun 8, 2020

akoehn commented Sep 16, 2020

lemma of compound words contains only the headword #3

lemma of compound words contains only the headword #3

Comments

pekoli commented Jun 8, 2020

akoehn commented Jun 8, 2020

pekoli commented Jun 8, 2020

akoehn commented Sep 16, 2020