Tagging of words that end in a digit, e.g. Boeing777 #101

anatoleg · 2019-06-10T12:28:32Z

The tagger treats the words that end in a digit as numbers assigning them upos NUM. That causes incorrect tagging of other words in a phrase and incorrect parsing, especially in languages with cases, such as Russian. Is there any way to fix this and make the tagger tag such words as NOUNs?
Just fixing the output of the tagger for this word does not change the incorrect case features on other words.

foxik · 2019-06-10T13:00:30Z

Overriding some tags is unfortunately not easy currently. One possibility is to add such words to training data, but that is usually infeasible. The other possibility is to explicitly allow some list of UPOSes for every input word -- you could allow only NOUN UPOS for Boeing777, but that is currently not implemented in UDPipe.

A very hacky solution which you can do currently is to modify the input (i.e., replace Boeing777 with Boeing or Boeing###).

BTW, we are preparing UDPipe 2.0 with considerably better results, which a) should solve this kind of problems automatically (the current tagger guesses unknown words from prefixes and suffixes -- concentrating on the 777 at the end; the new one will consider the whole word), b) will allow specifying possible analyses for every input word.

anatoleg · 2019-06-11T16:00:19Z

Thank you very much for the response. We are using a similar hack and are eagerly waiting for the next release when we will be able to discard it. The (b) point in your response is particularly intriguing since it can potentially fix a number of current problems. For example, “departs” in “airplane departs from Prague” is tagged as NOUN, which, needless to say, causes wrong parses. If we could specify that “departs” is a NOUN for the tagger, it should solve this problem. This facility should be extended to the features as well as upos. For example, in Russian, the word “сбит” (shot down) in “самолет сбит ракетой” (airplane shot down by a missile), is correctly tagged as VERB but in the wrong voice - active instead of passive. The makes the airplane “nsubj” instead of an object during parsing. In general, a statistical system will inevitably make mistakes and a facility to correct them without creating a new training set would be very welcome. When can we expect UDPipe 2.0?

…

On Jun 10, 2019, at 9:00 AM, Milan Straka ***@***.***> wrote: Overriding some tags is unfortunately not easy currently. One possibility is to add such words to training data, but that is usually infeasible. The other possibility is to explicitly allow some list of UPOSes for every input word -- you could allow only NOUN UPOS for Boeing777, but that is currently not implemented in UDPipe. A very hacky solution which you can do currently is to modify the input (i.e., replace Boeing777 with Boeing or Boeing###). BTW, we are preparing UDPipe 2.0 with considerably better results, which a) should solve this kind of problems automatically (the current tagger guesses unknown words from prefixes and suffixes -- concentrating on the 777 at the end; the new one will consider the whole word), b) will allow specifying possible analyses for every input word. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#101?email_source=notifications&email_token=ABH5KRAWPZLUNRUEIHBMN5DPZZF7BA5CNFSM4HWTB7V2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXJZY4Q#issuecomment-500407410>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABH5KRA7TH5KKWANVYVXCVLPZZF7BANCNFSM4HWTB7VQ>.

foxik · 2019-06-11T19:03:17Z

As for the release, I cannot unfortunately make any promises -- I am teaching a lot and doing research, so the software work is currently not high-priority for me. I hope to have an inference-only prototype in summer, but without changing the API. Then I want to also support training and changing the API to support the b) point -- but we are talking about Q4 of 2019.

AleksandrsBerdicevskis · 2020-03-20T15:02:47Z

Just a comment on this issue: such words are being tagged as NUM even if I am using my own model, trained on a not-exactly-UD input that does not have the NUM tag at all.

foxik · 2020-03-20T15:56:29Z

Yeah, the NUM is hardcoded, together with PUNCT and SYM:

udpipe/src/trainer/trainer_morphodita_parsito.cpp

Lines 629 to 631 in 31f0b8c

    
           dictionary_special_tags.number_tag = most_frequent_tag(training, "NUM", use_xpostag, use_feats, combined_tag); 
        
           dictionary_special_tags.punctuation_tag = most_frequent_tag(training, "PUNCT", use_xpostag, use_feats, combined_tag); 
        
           dictionary_special_tags.symbol_tag = most_frequent_tag(training, "SYM", use_xpostag, use_feats, combined_tag);

Should be improved with the (still not released) next version...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tagging of words that end in a digit, e.g. Boeing777 #101

Tagging of words that end in a digit, e.g. Boeing777 #101

anatoleg commented Jun 10, 2019

foxik commented Jun 10, 2019

anatoleg commented Jun 11, 2019 via email

foxik commented Jun 11, 2019

AleksandrsBerdicevskis commented Mar 20, 2020 •

edited

Loading

foxik commented Mar 20, 2020

Tagging of words that end in a digit, e.g. Boeing777 #101

Tagging of words that end in a digit, e.g. Boeing777 #101

Comments

anatoleg commented Jun 10, 2019

foxik commented Jun 10, 2019

anatoleg commented Jun 11, 2019 via email

foxik commented Jun 11, 2019

AleksandrsBerdicevskis commented Mar 20, 2020 • edited Loading

foxik commented Mar 20, 2020

AleksandrsBerdicevskis commented Mar 20, 2020 •

edited

Loading