-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tagging of words that end in a digit, e.g. Boeing777 #101
Comments
Overriding some tags is unfortunately not easy currently. One possibility is to add such words to training data, but that is usually infeasible. The other possibility is to explicitly allow some list of UPOSes for every input word -- you could allow only NOUN UPOS for Boeing777, but that is currently not implemented in UDPipe. A very hacky solution which you can do currently is to modify the input (i.e., replace Boeing777 with Boeing or Boeing###). BTW, we are preparing UDPipe 2.0 with considerably better results, which a) should solve this kind of problems automatically (the current tagger guesses unknown words from prefixes and suffixes -- concentrating on the 777 at the end; the new one will consider the whole word), b) will allow specifying possible analyses for every input word. |
Thank you very much for the response. We are using a similar hack and are eagerly waiting for the next release when we will be able to discard it. The (b) point in your response is particularly intriguing since it can potentially fix a number of current problems. For example, “departs” in “airplane departs from Prague” is tagged as NOUN, which, needless to say, causes wrong parses. If we could specify that “departs” is a NOUN for the tagger, it should solve this problem. This facility should be extended to the features as well as upos. For example, in Russian, the word “сбит” (shot down) in “самолет сбит ракетой” (airplane shot down by a missile), is correctly tagged as VERB but in the wrong voice - active instead of passive. The makes the airplane “nsubj” instead of an object during parsing.
In general, a statistical system will inevitably make mistakes and a facility to correct them without creating a new training set would be very welcome.
When can we expect UDPipe 2.0?
… On Jun 10, 2019, at 9:00 AM, Milan Straka ***@***.***> wrote:
Overriding some tags is unfortunately not easy currently. One possibility is to add such words to training data, but that is usually infeasible. The other possibility is to explicitly allow some list of UPOSes for every input word -- you could allow only NOUN UPOS for Boeing777, but that is currently not implemented in UDPipe.
A very hacky solution which you can do currently is to modify the input (i.e., replace Boeing777 with Boeing or Boeing###).
BTW, we are preparing UDPipe 2.0 with considerably better results, which a) should solve this kind of problems automatically (the current tagger guesses unknown words from prefixes and suffixes -- concentrating on the 777 at the end; the new one will consider the whole word), b) will allow specifying possible analyses for every input word.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#101?email_source=notifications&email_token=ABH5KRAWPZLUNRUEIHBMN5DPZZF7BA5CNFSM4HWTB7V2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXJZY4Q#issuecomment-500407410>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABH5KRA7TH5KKWANVYVXCVLPZZF7BANCNFSM4HWTB7VQ>.
|
As for the release, I cannot unfortunately make any promises -- I am teaching a lot and doing research, so the software work is currently not high-priority for me. I hope to have an inference-only prototype in summer, but without changing the API. Then I want to also support training and changing the API to support the b) point -- but we are talking about Q4 of 2019. |
Just a comment on this issue: such words are being tagged as NUM even if I am using my own model, trained on a not-exactly-UD input that does not have the NUM tag at all. |
Yeah, the udpipe/src/trainer/trainer_morphodita_parsito.cpp Lines 629 to 631 in 31f0b8c
Should be improved with the (still not released) next version... |
The tagger treats the words that end in a digit as numbers assigning them upos NUM. That causes incorrect tagging of other words in a phrase and incorrect parsing, especially in languages with cases, such as Russian. Is there any way to fix this and make the tagger tag such words as NOUNs?
Just fixing the output of the tagger for this word does not change the incorrect case features on other words.
The text was updated successfully, but these errors were encountered: