Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deinflection of 見とれていたら #1966

Open
Tomalak opened this issue Sep 4, 2024 · 11 comments
Open

Deinflection of 見とれていたら #1966

Tomalak opened this issue Sep 4, 2024 · 11 comments

Comments

@Tomalak
Copy link

Tomalak commented Sep 4, 2024

This is currently (v 1.20.0) deinflected as (< humble or Kansai dialect < continuous < potential < continuous < -tara) from 見る, which seems a little weird, not at least because of the duplicated "continuous",

screenshot of the popup

Deinflecting as (< continuous < -tara) from 見とれる makes more sense. This is only the second hit in the result list - even for itself.

@birtles
Copy link
Member

birtles commented Sep 4, 2024

@enellis Any ideas what went wrong here?

We certainly should fix 見とれる too. I thought we had a rule that prioritized longer matches so I'm surprised that 見とれる comes before 見る.

@Tomalak
Copy link
Author

Tomalak commented Sep 4, 2024

It would probably be useful to take the number of deinflection steps into account for result sorting - in cases where a verb form can be deinflected in more than one way, the simpler deinflection chains should sort higher.

@enellis
Copy link
Contributor

enellis commented Sep 4, 2024

Also, recognizing the continuous form twice doesn't make any sense grammatically, right? So we should probably add a rule to prevent this.

@enellis
Copy link
Contributor

enellis commented Sep 28, 2024

After preventing duplicate reasons in the reason chains, the results for 見とれていたら are now as expected:

Bildschirmfoto 2024-09-28 um 12 37 50

However, when scanning 見とれる, the result for 見る somehow still appears before the result for 見とれる.

EDIT: Hmm, something seems off with the sorting in general. When matching なる, I would expect 成る to be the top result, but it only appears fifth.

@birtles
Copy link
Member

birtles commented Sep 30, 2024

EDIT: Hmm, something seems off with the sorting in general. When matching なる, I would expect 成る to be the top result, but it only appears fifth.

Yeah, there's something off there. I added some logic to prioritize kana-only entries when searching with kana and it seems to be throwing the result off here. I have an alternative implementation that gets this case right so I'll have to look into exactly why. Unfortunately I'm mostly unavailable this week so it might be a few days.

@birtles
Copy link
Member

birtles commented Oct 4, 2024

However, when scanning 見とれる, the result for 見る somehow still appears before the result for 見とれる.

I had a look into this and here's what I worked out so far.

Firstly, the 見とれる/見る case is different to the なる/成る case.

The 見とれる/見る case

This comes about due to the fix for #1722. That is, after deinflecting, we sort within the results by priority.

For the input 見とれる we end up producing deinflection candidates "見とれる, 見とれるる, 見とる, 見る". We look each up to see what matching results we find and then sort the results.

For the issue in #1722, that allows us to sort 進む before 進ぶ when deinflecting 進んでいます。

For this case, however, it will mean we'll sort the result for 見る before the result for 見とれる.

I think it's legitimate to address that by first sorting by the length of the matched text.

I've implemented that locally and so far it seems to work.

(Edit: Now pushed to the fix-sorting branch.)

The なる/成る case

This comes about because of #1610 and #1657. That is, when you look up a kana result, we'll favor words that have kana-only headwords.

In that sense, it's sort-of doing the right thing. The first result is 生る whose only sense is marked as "usually kana" so it gets the special "kana match" treatment.

It's unfortunate, however, because 成る has 11 senses, 10 of which are marked as "usually kana". If all 11 senses were marked as "usually kana" it would have gotten the special treatment.

So I suspect the sorting heuristic needs some tweaking to so that if some portion of the matched senses are marked as "usually kana" we give it the special "kana match" ranking.

birtles added a commit that referenced this issue Oct 4, 2024
Without this, when deinflecting 見とれる we'll end up sorting the
results for 見る before the results for 見とれる itself.

See #1966.
@Tomalak
Copy link
Author

Tomalak commented Oct 4, 2024

This comes about due to the fix for #1722

Ironically, that was reported by me as well...

I think it's legitimate to address that by first sorting by the length of the matched text.

I.e. longer matches sort first? Wouldn't this still lead to ties within candidate matches of equal length? That's where my idea came from to take into account the length of the deinflection chain as well.

birtles added a commit that referenced this issue Oct 5, 2024
Without this, when deinflecting 見とれる we'll end up sorting the
results for 見る before the results for 見とれる itself.

See #1966.
birtles added a commit that referenced this issue Oct 5, 2024
…king up by kana

Without this, when we look up "なる" we'll fail to prioritize the entry
for 成る because 1 of its 11 senses is not marked as "usually kana".

This also ensures we don't consider non-English senses (which don't
have "usually kana" annotations) or unmatched senses.

See #1966.
@birtles
Copy link
Member

birtles commented Oct 5, 2024

I think it's legitimate to address that by first sorting by the length of the matched text.

I.e. longer matches sort first? Wouldn't this still lead to ties within candidate matches of equal length? That's where my idea came from to take into account the length of the deinflection chain as well.

If there are ties within candidate matches of equal length, they will be sorted by priority.

I suspect there are some cases where the length of the deinflection chain makes sense but I haven't found any yet.

birtles added a commit that referenced this issue Oct 5, 2024
Without this, when deinflecting 見とれる we'll end up sorting the
results for 見る before the results for 見とれる itself.

See #1966.
birtles added a commit that referenced this issue Oct 5, 2024
…king up by kana

Without this, when we look up "なる" we'll fail to prioritize the entry
for 成る because 1 of its 11 senses is not marked as "usually kana".

This also ensures we don't consider non-English senses (which don't
have "usually kana" annotations) or unmatched senses.

See #1966.
@enellis
Copy link
Contributor

enellis commented Oct 21, 2024

I just stumbled upon this case: 同じ.

I would expect 同じ to be the first result as it is more common, and then 同じる < masu-stem. Somehow it is the other way around.

@birtles
Copy link
Member

birtles commented Oct 22, 2024

I just stumbled upon this case: 同じ.

I would expect 同じ to be the first result as it is more common, and then 同じる < masu-stem. Somehow it is the other way around.

Oh, yes, that's definitely a regression. I'll have to investigate. Maybe we do need to sort by the length of the deinflection chain as suggested by @Tomalak after all.

@Konstantin-Glukhov
Copy link

Shouldn't the match be determined by a word boundary first and then the frequency (commonness)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants