Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tagging of words that end in a digit, e.g. Boeing777 #101

Open
anatoleg opened this issue Jun 10, 2019 · 5 comments
Open

Tagging of words that end in a digit, e.g. Boeing777 #101

anatoleg opened this issue Jun 10, 2019 · 5 comments

Comments

@anatoleg
Copy link

The tagger treats the words that end in a digit as numbers assigning them upos NUM. That causes incorrect tagging of other words in a phrase and incorrect parsing, especially in languages with cases, such as Russian. Is there any way to fix this and make the tagger tag such words as NOUNs?
Just fixing the output of the tagger for this word does not change the incorrect case features on other words.

@foxik
Copy link
Member

foxik commented Jun 10, 2019

Overriding some tags is unfortunately not easy currently. One possibility is to add such words to training data, but that is usually infeasible. The other possibility is to explicitly allow some list of UPOSes for every input word -- you could allow only NOUN UPOS for Boeing777, but that is currently not implemented in UDPipe.

A very hacky solution which you can do currently is to modify the input (i.e., replace Boeing777 with Boeing or Boeing###).

BTW, we are preparing UDPipe 2.0 with considerably better results, which a) should solve this kind of problems automatically (the current tagger guesses unknown words from prefixes and suffixes -- concentrating on the 777 at the end; the new one will consider the whole word), b) will allow specifying possible analyses for every input word.

@anatoleg
Copy link
Author

anatoleg commented Jun 11, 2019 via email

@foxik
Copy link
Member

foxik commented Jun 11, 2019

As for the release, I cannot unfortunately make any promises -- I am teaching a lot and doing research, so the software work is currently not high-priority for me. I hope to have an inference-only prototype in summer, but without changing the API. Then I want to also support training and changing the API to support the b) point -- but we are talking about Q4 of 2019.

@AleksandrsBerdicevskis
Copy link

AleksandrsBerdicevskis commented Mar 20, 2020

Just a comment on this issue: such words are being tagged as NUM even if I am using my own model, trained on a not-exactly-UD input that does not have the NUM tag at all.

@foxik
Copy link
Member

foxik commented Mar 20, 2020

Yeah, the NUM is hardcoded, together with PUNCT and SYM:

dictionary_special_tags.number_tag = most_frequent_tag(training, "NUM", use_xpostag, use_feats, combined_tag);
dictionary_special_tags.punctuation_tag = most_frequent_tag(training, "PUNCT", use_xpostag, use_feats, combined_tag);
dictionary_special_tags.symbol_tag = most_frequent_tag(training, "SYM", use_xpostag, use_feats, combined_tag);

Should be improved with the (still not released) next version...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants