-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WSD issues resulting in bad lemmatization/PoS tag sequence #1381
Comments
I'd like to push back a little on the idea that It's hard to even come up with examples that make sense. Nevertheless, on the walk to work I came up with a few. If we add these to the training data, the models might pick up that
proposed parses for these:
|
also, as a followup, the CoreNLP lemmatizer already properly handles |
…s might learn to choose that when appropriate, although it should be pointed out that with 1052 training examples of number_NN, it is unlikely they actually will. Addresses stanfordnlp/CoreNLP#1381
Oh, I wasn't clear. It currently does not label anything any tag other
than NN, since the overwhelming number of training examples are of that
tag. What we can do is add a few more examples in which the tag is JJR,
and retrain the models (which may take a while), and then perhaps it will
use that tag instead. I'm not super confident, though, considering how
many NN examples there are.
|
Aha! I understand, thanks for the clarification. Those proposed parses look fine to me; I agree that it's not likely to overcome that degree of word-sense imbalance in the data but it certainly can't hurt to include a few bonus examples for the PoS tagger. |
Oh and also, now I'm confused by your comment "also, as a followup, the CoreNLP lemmatizer already properly handles number_JJR" - as best as I can tell, it definitely is not handling that scenario, and is lemmatizing it to "number_NN". |
If by chance you give the lemmatizer |
Aha! Now I understand, thank you. So if the "right" PoS tag is assigned,
the lemmatizer knows what to do with it. That's good to know!
…-SB
On Tue, Aug 1, 2023 at 3:33 PM John Bauer ***@***.***> wrote:
also, as a followup, the CoreNLP lemmatizer already properly handles
number_JJR
If by chance you give the lemmatizer number with the tag JJR, it returns
the lemma numb. I used it to convert those trees to a UD representation
in the commit I made above, for example.
***@***.***
<stanfordnlp/handparsed-treebank@c1a405b>
—
Reply to this email directly, view it on GitHub
<#1381 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAMVHLP3PYKNLSDXK7G5E3XTF74HANCNFSM6AAAAAA3AC7WJ4>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Hello! I am running into a word-sense disambiguation issue where CoreNLP seems to be systematically struggling with adverbs that share a surface form with different words. For example, "number" (as in, "the number five") and "number" (as in, "his right side was number than his left"). In both sentences, CoreNLP interprets the token "number" as the noun "number" (NN); in the second sentence, it should be tagging it as an adverbial form of the adjective "numb" (RBR). The lemmatizer is also mapping "number"/RBR to "number" rather than "numb", which seems like it may be part of the issue. And, of course, since the wrong tag ends up being assigned, any downstream annotation is wrong as well (dependencies, etc.).
I've experimented a bit with different syntactic constructions, and have not yet managed to successfully find a formulation that does get CoreNLP to tag "number" as an RBR instead of an NN.
Obviously, the tagger is fundamentally a statistical model and it's gonna do what it's gonna do, but on the other hand this isn't a particularly odd word, nor is it syntactically ambiguous, so I thought I'd see if anybody else had run into this sort of issue or if there was something I could do to change the tagger's behavior. Thanks in advance for any insight you may be able to provide!
The text was updated successfully, but these errors were encountered: